Re: improving wraparound behavior

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: improving wraparound behavior
Date: 2019-05-04 02:03:18
Message-ID: 20190504020318.GO6197@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Greetings,

* Robert Haas (robertmhaas(at)gmail(dot)com) wrote:
> I am not sure exactly how to fix this,
> because the calculation we use to determine the XID that can be used
> to vacuum a specific table is pretty complex; how can the postmaster
> know whether it's going to be able to make any progress in *any* table
> in some database to which it's not even connected? But it's surely
> crazy to just keep doing something over and over that can't possibly
> work.

I definitely agree that it's foolish to keep doing something that isn't
going to work, and it seems like a pretty large part of the issue here
is that we don't have enough information to be more intelligent because
we aren't connected to the database that needs the work to be done.

Now, presuming we're talking about 'new feature work' here to try and
address this, and not something that we think we can back-patch, I had
another thought.

I certainly get that having lots of extra processes around can be a
waste of resources... but I don't recall a lot of people complaining
about the autovacuum launcher process using up lots of resources
unnecessairly.

Perhaps we should consider having the 'global' autovacuum launcher, when
it's decided that a database needs work to be done, launch a 'per-DB'
launcher which manages launching autovacuum processes for that database?
If the launcher is still running then it's because there's still work to
be done on that database and the 'global' autovacuum launcher can skip
it. If the 'per-DB' launcher runs out of things to do, and the database
it's working on is no longer in a danger zone, then it exits.

There are certainly some other variations on this idea and I don't know
that it's really better than keeping more information in shared
memory or something else, but it seems like part of the issue is that
the thing firing off the processes hasn't got enough info to do so
intelligently and maybe we could fix that by having per-DB launchers
that are actually connected to a DB.

> 2. Once you get to the point where you start to emit errors when
> attempting to assign an XID, you can still run plain old VACUUM
> because it doesn't consume an XID ... except that if it tries to
> truncate the relation, then it will take AccessExclusiveLock, which
> has to be logged, which forces an XID assignment, which makes VACUUM
> fail. So if you burn through XIDs until the system gets to this
> point, and then you roll back the prepared transaction that caused the
> problem in the first place, autovacuum sits there trying to vacuum
> tables in a tight loop and fails over and over again as soon as hits a
> table that it thinks needs to be truncated. This seems really lame,
> and also easily fixed.
>
> Attached is a patch that disables vacuum truncation if xidWarnLimit
> has been reached. With this patch, in my testing, autovacuum is able
> to recover the system once the prepared transaction has been rolled
> back. Without this patch, not only does that not happen, but if you
> had a database with enough relations that need truncation, you could
> conceivably cause XID wraparound just from running a database-wide
> VACUUM, the one tool you have available to avoid XID wraparound. I
> think that this amounts to a back-patchable bug fix.

Sounds reasonable to me but I've not looked at the patch at all.

> 3. The message you get when you hit xidStopLimit seems like bad advice to me:
>
> ERROR: database is not accepting commands to avoid wraparound data
> loss in database "%s"
> HINT: Stop the postmaster and vacuum that database in single-user mode.
> You might also need to commit or roll back old prepared transactions,
> or drop stale replication slots.
>
> Why do we want people to stop the postmaster and vacuum that database
> in single user mode? Why not just run VACUUM in multi-user mode, or
> let autovacuum take care of the problem? Granted, if VACUUM is going
> to fail in multi-user mode, and if switching to single-user mode is
> going to make it succeed, then it's a good suggestion. But it seems
> that it doesn't fail in multi-user mode, unless it tries to truncate
> something, which is a bug we should fix. Telling people to go to
> single-user mode where they can continue to assign XIDs even though
> they have almost no XIDs left seems extremely dangerous, actually.
>
> Also, I think that old prepared transactions and stale replication
> slots should be emphasized more prominently. Maybe something like:
>
> HINT: Commit or roll back old prepared transactions, drop stale
> replication slots, or kill long-running sessions.
> Ensure that autovacuum is progressing, or run a manual database-wide VACUUM.

I agree that a better message would definitely be good and that
recommending single-user isn't a terribly useful thing to do. I might
have misunderstood it, but it sounded like Andres was proposing a new
function which would basically tell you what's holding back the xid
horizon and that sounds fantastic and would be great to include in this
message, if possible.

As in:

HINT: Run the function pg_what_is_holding_xmin_back() to identify what
is preventing autovacuum from progressing and address it.

Or some such.

Thanks,

Stephen

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2019-05-04 02:06:20 Re: improving wraparound behavior
Previous Message David Fetter 2019-05-04 01:41:47 Re: range_agg