Quick Links

Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, robertmhaas(at)gmail(dot)com, Alexander Lakhin <exclusion(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
Date:	2023-12-31 23:53:34
Message-ID:	CAH2-Wz=X88igoxQvAGW2cFrTsEXK+VJJHap=90-z=juD5DqYdw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs

On Sun, Dec 24, 2023 at 6:44 PM Noah Misch <noah(at)leadboat(dot)com> wrote:
> On Mon, Nov 01, 2021 at 04:15:27PM +0100, Matthias van de Meent wrote:
> > Another alternative would be to replace the use of vacrel->OldestXmin
> > with `vacrel->vistest->maybe_needed` in lazy_scan_prune, but I believe
>
> v17 commit 1ccc1e05ae essentially did that.

Obviously, 1ccc1e05ae would fix the immediate problem of infinite
retries, since it just rips out the loop.

> > that is not legal in how vacuum works (we cannot unilaterally decide
> > that we want to retain tuples < OldestXmin).
>
> Do you think commit 1ccc1e05ae creates problems in that respect? It does have
> the effect of retaining tuples for which GlobalVisState rules "retain" but
> HeapTupleSatisfiesVacuum() would have ruled "delete". If that doesn't create
> problems, then back-patching commit 1ccc1e05ae could be a fix for remaining
> infinite-retries scenarios, if any.

My guess is that there is a decent chance that backpatching 1ccc1e05ae
would be okay, but that isn't much use. I really don't know either way
right now. And I wouldn't like to speculate too much further before
gaining a proper understanding of what's going on here. Seems to be
specific to partitioning with cross-partition updates.

> Using the https://postgr.es/m/d5d5af5d-ba46-aff3-9f91-776c70246cc3@gmail.com
> procedure, I see these results:
>
> - A commit from the day of that email, 2021-10-29, (5ccceb2946) reproduced the
> "numretries" assertion failure in each of five 10m runs.
>
> - At the 2022-01-13 commit (18b87b201f^) just before the fix for #17255, the
> script instead gets: FailedAssertion("HeapTupleHeaderIsHeapOnly(htup)",
> File: "pruneheap.c", Line: 964. That happened once in two 10m runs, so it
> was harder to reach than the numretries failure.
>
> - At 18b87b201f, a 1440m script run got no failures.
>
> I've seen symptoms that suggest the infinite-numretries bug remains reachable,
> but I don't know how to reproduce them. (Given the upthread notes about xmin
> going backward during end-of-xact, I'd first try some pauses there.) If it
> does remain reachable, likely some other code change between 2021-10 and
> 2022-01 made this particular test script no longer reach it.

I am aware of a production database that appears to run into the same
problem. Inserts and concurrent cross-partition updates are used
heavily on this instance (the table in question uses partitioning).
Perhaps you happened upon a similar problematic production database,
and found this thread when researching the issue? Maybe we've both
seen the same problem in the wild?

I have seen VACUUM get stuck like this on multiple versions, all
associated with the same application code/partitioning
scheme/workload. This includes a 15.4 instance, and various 14.* point
release instances. It seems likely that there is a bug here, and that
it affects Postgres 14, 15, and 16.

--
Peter Geoghegan

In response to

Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune() at 2023-12-25 02:44:02 from Noah Misch

Responses

Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune() at 2024-01-06 20:24:13 from Noah Misch

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	Michael B. Williams	2024-01-01 13:04:24	Re: Segmentation fault caused by Postgrest - reateplan.c:6178 - prepare_sort_from_pathkeys
Previous Message	Michael Zhilin	2023-12-31 15:02:58	BUG: deadlock between autovacuum worker and client backend during removal of orphan temp tables with sequences