Re: Corrupt index stopping autovacuum system wide

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, Aaron Pelz <aaronepelz(at)gmail(dot)com>, pgsql-general General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Corrupt index stopping autovacuum system wide
Date: 2019-07-17 22:14:21
Message-ID: 20190717221421.muxfrw2o2iai4xk6@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi,

On 2019-07-17 13:27:23 -0400, Tom Lane wrote:
> My concern here is that if we have blinders on to the extent of only
> processing that one table or DB, we're unnecessarily allowing bloat to
> occur in other tables, and causing that missed vacuuming work to pile
> up so that there's more of it to be done once the breakage is cleared.

That a pretty common problem in the real world, with or without problems
like corrupted indexes. Autovacuum's scheduling is just too simplistic
to avoid that.

Common problems:

- If one table in one database has an xmin older than
autovacuum_freeze_max_age, the autovacuum launcher continuously throws
more workers at that database. Even though there's a lot of work to be
done in other databases.

That's because do_start_worker() ignores the "adl_next_worker"
mechanism, and *always* starts more workers for the database with the
oldest datfrozenxid (same with datminmxid), and because we explicitly
don't start workers for other databases ("/* ignore not-at-risk DBs
*/").

That's especially bad if that database has a longrunning transaction
preventing cleanup from happening - the vacuuming will not be able to
advance the horizon, and all others are starved.

- If there's a table "early in pg_class", and vacuum on it failes, it
will prevent vacuuming other tables. It's pretty common for vacuum to
fail often for a table, e.g. because it's sometimes exclusively locked,
which then causes autovacuum to kill itself. There's absolutely no
mechanism for autovacuum workers to skip over that table for a while,
leading to all other tables in the database not being vacuumed, unless
there happens to be second worker in the database, while the first
vacuum hasn't failed.

This obviously also applies to the corrupted index case.

The 'kill itself' behaviour is exascerbated by lazy_truncate_heap()'s
exclusive lock - which will obviously trigger other backend to send
cancellation requests. There's unfortunately a number of places where
that leads us to just throw all the work done away, and not update
pg_class.relfrozenxid/datfrozenxid

- Anti-wraparound vacuums are more impactful (don't cancel themselves
upon lock conflicts, cause more IO, wait for cleanup locks), often
emit scary messages ("oldest xmin is far in the past"). But we don't
have *any* mechanism that avoids them for very common scenarios.

E.g. for insert-mostly workloads, there'll never be enough dead tuples
to cause a vacuum to happen before autovacuum_freeze_max_age is
reached. That prevents vacuum_freeze_table_age from pre-empting the
need to do an anti-wraparound vacuum, by increasing the xid horizon.

We literally call anti-wraparound autovacuums "emergency" in the code,
yet they're an almost inevitablepart of running postgres.

- There's no meaningful in-database prioritization. Which means that
autovacuum workers might happily vacuum the table just a bit over the
thresholds, even though there's much worse tables around.

Especially on a busy and large databases that can lead to
anti-wraparound started launchers effectively never getting to
vacuuming tables above autovacuum_freeze_max_age, because tables
earlier in pg_class are modified heavily enough that they have dead
tuples above the thresholds by the time vacuum finishes. To get to
the anti-wraparound vacuum needing table, a single launcher needs to
go through all tables preceding the table in pg_class that need
vacuuming (only skipping over ones that are concurrently vacuumed by
somebody else, but not ones that have *recently* been vacuumed).

I kinda forgot how bad this one was until looking at the code again.

And there are plenty more.

My impression is that these are really hard to fix unless we develop a
new scheduling approach. And that scheduling approach probably needs to
be more stateful than the current code.

IMO these are the main concerns for how work needs to be distributed:

- work in different databases needs to be scheduled in a smarter way, in
particular anti-wraparound cannot simply cause only the database with
the to oldest datfrozenxid to be vacuumed until the wraparound
"danger" is over (especially not if there's other database needing to
be anti-wrap vacuumed)

- tables within a database need to be prioritized in a smarter way, so
databases with a lot of bloat get vacuumed before ones with a lot less
bloat, and similarly tables with the oldest horizon need to be
vacuumed before ones with newer horizons, even if all of the tables
are above the autovacuum thresholds.

- tables need to be prioritized across databases, to avoid problems like
one currently vacuumed table causing unnecessary anti-wraparound
workers to be launched for a database where they can't perform any
work.

- there needs to be a separate rung between normal autovacuums and
anti-wraparound vacuums. The inbetween level does 'aggressive'
vacuuming (so it freezes), even for tables with just inserts, but
keeps interruptible, and doesn't have a scary 'anti-wraparound' name.

- Repeated failures on one table may not starve all work for other
tables in a database.

- errors vacuuming one table may not cause vacuuming on tables with a
lower priority from happening until that error is fixed.

To me that means that we need prioritization across databases, and
between tables, and probably by multiple criteria. I suspect there need
to be multiple criteria how urgent vacuuming is, and autovacuum ought to
try to make progress on all of them.

I suspect the way to implement that would be to have avlauncher start
workers for all databases independent of tasks needing to be done (but
at a slower schedule than autovacuum_naptime). Those workers then would
put the tables needing vacuuming in the database into the global
prioritized list, displacing entries that are less important once the
list is full. Proper autovacuuming work would then be done strictly in
priority order.

To avoid one priority criteria (e.g. age(relfrozenxid) from preventing
any work done on other criteria (e.g. number of dead tuples), I suspect
we ought to round robin between different criterias. E.g. have one list
of tables needing vacuum ordered by age(relfrozenxid), one by
%dead-tuples, and one by last-failure time, and have workers try to make
progress on all.

If that sounds familiar to what is being argued about around UNDO
processing in workers, that'd not be an accident.

A lesser, but still pretty pressing concern, is that we currently have
no system that maintains the visibilitymap readily for insert-mostly
tables. That means that a) index-only scans aren't going to happen for
such tables, unless they're manually vacuumed b) the anti-wraparound
vacuum that eventually is going to happen, is going to be extremely
expensive/slow.

Regards,

Andres

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message shauncutts 2019-07-18 00:43:58 Re: Corrupt index stopping autovacuum system wide
Previous Message Peter Geoghegan 2019-07-17 19:58:35 Re: Corrupt index stopping autovacuum system wide