Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, robertmhaas(at)gmail(dot)com, Alexander Lakhin <exclusion(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
Date: 2024-01-10 19:06:42
Message-ID: CAH2-Wznv94Q_Td8OS8bAN7fYLpfU6CGgjn6Xau5eJ_sDxEGeBA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Tue, Jan 9, 2024 at 6:16 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> On Tue, Jan 9, 2024 at 4:44 PM Noah Misch <noah(at)leadboat(dot)com> wrote:
> > On Tue, Jan 09, 2024 at 03:59:19PM -0500, Peter Geoghegan wrote:
> > > Did the affected system that you investigated happen to have an
> > > atypically high number of databases? The system 15.4 system that I saw
> > > the problem on had almost 3,000 databases.
> >
> > No, single-digit database count here.
>
> My suspicion was that this factor might increase the propensity of
> calls to GetOldestNonRemovableTransactionId (used to establish
> VACUUM's OldestXmin) to not agree with the GlobalVis* based state used
> by pruneheap.c, in the way that we need to worry about here (i.e.
> inconsistencies that lead to VACUUM getting stuck inside
> lazy_scan_prune's loop).

Another question about your database/system: does VACUUM get stuck
while scanning a page some time after it has already completed a round
of index vacuuming? And if so, does an nbtree bulk delete end up
deleting and then recycling many index leaf pages (e.g., due to bulk
range deletions)?

That's what I see here -- I don't think that pruning leaves behind
even a single live heap tuple, despite scanning thousands of pages
before reaching the page that it gets stuck on. Could be another red
herring. But it doesn't seem impossible that some of the nbtree calls
to procarray.c routines performed by code added by my commit
9dd963ae25, "Recycle nbtree pages deleted during same VACUUM", are
somehow related. That is, that code could be part of the chain of
events that cause the problem (whether or not the code itself is
technically at fault).

I'm referring to calls such as the
"GetOldestNonRemovableTransactionId(NULL)" and
"GlobalVisCheckRemovableFullXid()" calls that take place inside
_bt_pendingfsm_finalize(). It's not like we do stuff like that in very
many other places.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Noah Misch 2024-01-10 19:38:51 Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
Previous Message Alexander Lakhin 2024-01-10 18:00:01 Re: BUG #17798: Incorrect memory access occurs when using BEFORE ROW UPDATE trigger