From: | Peter Geoghegan <pg(at)bowt(dot)ie> |
---|---|
To: | Noah Misch <noah(at)leadboat(dot)com> |
Cc: | Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, robertmhaas(at)gmail(dot)com, Alexander Lakhin <exclusion(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de> |
Subject: | Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune() |
Date: | 2024-01-09 23:16:01 |
Message-ID: | CAH2-Wzn57T=d7eB90m0wr+AiAXetk-NWA=ntS89R2mOcDimNsQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
On Tue, Jan 9, 2024 at 4:44 PM Noah Misch <noah(at)leadboat(dot)com> wrote:
> On Tue, Jan 09, 2024 at 03:59:19PM -0500, Peter Geoghegan wrote:
> > Did the affected system that you investigated happen to have an
> > atypically high number of databases? The system 15.4 system that I saw
> > the problem on had almost 3,000 databases.
>
> No, single-digit database count here.
My suspicion was that this factor might increase the propensity of
calls to GetOldestNonRemovableTransactionId (used to establish
VACUUM's OldestXmin) to not agree with the GlobalVis* based state used
by pruneheap.c, in the way that we need to worry about here (i.e.
inconsistencies that lead to VACUUM getting stuck inside
lazy_scan_prune's loop).
Using gdb I was able to determine that
ComputeXidHorizonsResultLastXmin == RecentXmin at some point long
after the system gets stuck (when I actually looked). So
GlobalVisTestShouldUpdate() doesn't return true at that point. And, I
see that VACUUM's OldestXmin value is between
GlobalVisDataRels.maybe_needed and
GlobalVisDataRels.definitely_needed. The deleted tuple's xmax is
committed according to OldestXmin (i.e. it's a value < OldestXmin),
and is < GlobalVisDataRels.definitely_needed, too. The same tuple xmax
is > GlobalVisDataRels.maybe_needed. As for this tuple's xmin, it was
already frozen by a previous VACUUM operation. The tuple infomask
flags indicate that it's a pretty standard deleted tuple.
Overall, there aren't a lot of details here that seem like they might
be out of the ordinary, hinting at a specific underlying cause.
It looks more like the assumptions that we make about OldestXmin
agreeing with GlobalVis* state just aren't quite robust, in general.
Ideally I'd be able to point to some specific assumption that has been
violated -- and we might yet tie the problem to some specific detail
that I've yet to identify. As I said upthread, I'm concerned that code
in places like procarray.c is rather loose about how the horizons are
recomputed, in a way that doesn't sit well with me.
GlobalVisTestShouldUpdate() thinks that it's okay to use
ComputeXidHorizonsResultLastXmin-based heuristics to decide when to
recompute horizons. It is more or less treated as a matter of weighing
costs against benefits -- not as a potential correctness issue.
--
Peter Geoghegan
From | Date | Subject | |
---|---|---|---|
Next Message | Richard Guo | 2024-01-10 07:44:13 | Re: BUG #18252: Assert in CheckOpSlotCompatibility() fails when recursive union filters tuples in non-recursive term |
Previous Message | Noah Misch | 2024-01-09 21:44:47 | Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune() |