Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From: Noah Misch <noah(at)leadboat(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, robertmhaas(at)gmail(dot)com, Alexander Lakhin <exclusion(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
Date: 2024-01-10 19:38:51
Message-ID: 20240110193851.f0.nmisch@google.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Wed, Jan 10, 2024 at 02:06:42PM -0500, Peter Geoghegan wrote:
> On Tue, Jan 9, 2024 at 6:16 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> > On Tue, Jan 9, 2024 at 4:44 PM Noah Misch <noah(at)leadboat(dot)com> wrote:
> > > On Tue, Jan 09, 2024 at 03:59:19PM -0500, Peter Geoghegan wrote:
> > > > Did the affected system that you investigated happen to have an
> > > > atypically high number of databases? The system 15.4 system that I saw
> > > > the problem on had almost 3,000 databases.
> > >
> > > No, single-digit database count here.
> >
> > My suspicion was that this factor might increase the propensity of
> > calls to GetOldestNonRemovableTransactionId (used to establish
> > VACUUM's OldestXmin) to not agree with the GlobalVis* based state used
> > by pruneheap.c, in the way that we need to worry about here (i.e.
> > inconsistencies that lead to VACUUM getting stuck inside
> > lazy_scan_prune's loop).
>
> Another question about your database/system: does VACUUM get stuck
> while scanning a page some time after it has already completed a round
> of index vacuuming?

I don't know. That particular system experienced the infinite loop only once.

> That's what I see here -- I don't think that pruning leaves behind
> even a single live heap tuple, despite scanning thousands of pages
> before reaching the page that it gets stuck on. Could be another red
> herring. But it doesn't seem impossible that some of the nbtree calls
> to procarray.c routines performed by code added by my commit
> 9dd963ae25, "Recycle nbtree pages deleted during same VACUUM", are
> somehow related. That is, that code could be part of the chain of
> events that cause the problem (whether or not the code itself is
> technically at fault).
>
> I'm referring to calls such as the
> "GetOldestNonRemovableTransactionId(NULL)" and
> "GlobalVisCheckRemovableFullXid()" calls that take place inside
> _bt_pendingfsm_finalize(). It's not like we do stuff like that in very
> many other places.

I see what you mean about the rarity and potential importance of
"GetOldestNonRemovableTransactionId(NULL)". There's just one other caller,
vac_update_datfrozenxid(), which calls it for an unrelated cause.

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Peter Geoghegan 2024-01-10 19:57:34 Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
Previous Message Peter Geoghegan 2024-01-10 19:06:42 Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()