Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, robertmhaas(at)gmail(dot)com, Alexander Lakhin <exclusion(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
Date: 2024-01-09 23:16:01
Message-ID: CAH2-Wzn57T=d7eB90m0wr+AiAXetk-NWA=ntS89R2mOcDimNsQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Tue, Jan 9, 2024 at 4:44 PM Noah Misch <noah(at)leadboat(dot)com> wrote:
> On Tue, Jan 09, 2024 at 03:59:19PM -0500, Peter Geoghegan wrote:
> > Did the affected system that you investigated happen to have an
> > atypically high number of databases? The system 15.4 system that I saw
> > the problem on had almost 3,000 databases.
>
> No, single-digit database count here.

My suspicion was that this factor might increase the propensity of
calls to GetOldestNonRemovableTransactionId (used to establish
VACUUM's OldestXmin) to not agree with the GlobalVis* based state used
by pruneheap.c, in the way that we need to worry about here (i.e.
inconsistencies that lead to VACUUM getting stuck inside
lazy_scan_prune's loop).

Using gdb I was able to determine that
ComputeXidHorizonsResultLastXmin == RecentXmin at some point long
after the system gets stuck (when I actually looked). So
GlobalVisTestShouldUpdate() doesn't return true at that point. And, I
see that VACUUM's OldestXmin value is between
GlobalVisDataRels.maybe_needed and
GlobalVisDataRels.definitely_needed. The deleted tuple's xmax is
committed according to OldestXmin (i.e. it's a value < OldestXmin),
and is < GlobalVisDataRels.definitely_needed, too. The same tuple xmax
is > GlobalVisDataRels.maybe_needed. As for this tuple's xmin, it was
already frozen by a previous VACUUM operation. The tuple infomask
flags indicate that it's a pretty standard deleted tuple.

Overall, there aren't a lot of details here that seem like they might
be out of the ordinary, hinting at a specific underlying cause.

It looks more like the assumptions that we make about OldestXmin
agreeing with GlobalVis* state just aren't quite robust, in general.
Ideally I'd be able to point to some specific assumption that has been
violated -- and we might yet tie the problem to some specific detail
that I've yet to identify. As I said upthread, I'm concerned that code
in places like procarray.c is rather loose about how the horizons are
recomputed, in a way that doesn't sit well with me.
GlobalVisTestShouldUpdate() thinks that it's okay to use
ComputeXidHorizonsResultLastXmin-based heuristics to decide when to
recompute horizons. It is more or less treated as a matter of weighing
costs against benefits -- not as a potential correctness issue.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Richard Guo 2024-01-10 07:44:13 Re: BUG #18252: Assert in CheckOpSlotCompatibility() fails when recursive union filters tuples in non-recursive term
Previous Message Noah Misch 2024-01-09 21:44:47 Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()