Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From: Noah Misch <noah(at)leadboat(dot)com>
To: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: robertmhaas(at)gmail(dot)com, Alexander Lakhin <exclusion(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
Date: 2023-12-25 02:44:02
Message-ID: 20231225024402.77@rfd.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Mon, Nov 01, 2021 at 04:15:27PM +0100, Matthias van de Meent wrote:
> Another alternative would be to replace the use of vacrel->OldestXmin
> with `vacrel->vistest->maybe_needed` in lazy_scan_prune, but I believe

v17 commit 1ccc1e05ae essentially did that.

> that is not legal in how vacuum works (we cannot unilaterally decide
> that we want to retain tuples < OldestXmin).

Do you think commit 1ccc1e05ae creates problems in that respect? It does have
the effect of retaining tuples for which GlobalVisState rules "retain" but
HeapTupleSatisfiesVacuum() would have ruled "delete". If that doesn't create
problems, then back-patching commit 1ccc1e05ae could be a fix for remaining
infinite-retries scenarios, if any.

On Wed, Nov 10, 2021 at 12:28:43PM -0800, Peter Geoghegan wrote:
> On Fri, Oct 29, 2021 at 6:30 AM Alexander Lakhin <exclusion(at)gmail(dot)com> wrote:
> > I can propose the debugging patch to reproduce the issue that replaces
> > the hang with the assert and modifies a pair of crash-causing test
> > scripts to simplify the reproducing. (Sorry, I have no time now to prune
> > down the scripts further as I have to leave for a week.)
> >
> > The reproducing script is:
>
> I cannot reproduce this bug by following your steps, even when the
> assertion is made to fail after only 5 retries (5 is still ludicrously
> excessive, 100 might be overkill). And even when I don't use a debug
> build (and make the assertion into an equivalent PANIC). I wonder why
> that is. I didn't have much trouble following your similar repro for
> bug #17255.

For what it's worth, I needed "-X" on the script's psql command lines to keep
my ~/.psqlrc from harming things. I also wondered if the regression database
needed to be populated with a "make installcheck" run. The script had a
"createdb regression" without a "make installcheck", so I assumed an empty
regression database was intended.

> My immediate goal in trying to follow your reproducer was to determine
> what effect (if any) the pending bugfix for #17255 [1] has on this
> bug. It seems more than possible that this bug is in fact a different
> manifestation of the same underlying problem we see in #17255. And so
> that should be the next thing we check here.
>
> [1] https://postgr.es/m/CAH2-WzkpG9KLQF5sYHaOO_dSVdOjM+dv=nTEn85oNfMUTk836Q@mail.gmail.com

Using the https://postgr.es/m/d5d5af5d-ba46-aff3-9f91-776c70246cc3@gmail.com
procedure, I see these results:

- A commit from the day of that email, 2021-10-29, (5ccceb2946) reproduced the
"numretries" assertion failure in each of five 10m runs.

- At the 2022-01-13 commit (18b87b201f^) just before the fix for #17255, the
script instead gets: FailedAssertion("HeapTupleHeaderIsHeapOnly(htup)",
File: "pruneheap.c", Line: 964. That happened once in two 10m runs, so it
was harder to reach than the numretries failure.

- At 18b87b201f, a 1440m script run got no failures.

I've seen symptoms that suggest the infinite-numretries bug remains reachable,
but I don't know how to reproduce them. (Given the upthread notes about xmin
going backward during end-of-xact, I'd first try some pauses there.) If it
does remain reachable, likely some other code change between 2021-10 and
2022-01 made this particular test script no longer reach it.

Thanks,
nm

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Richard Guo 2023-12-25 07:18:15 Re: BUG #18187: Unexpected error: "variable not found in subplan target lists" triggered by JOIN
Previous Message Laurenz Albe 2023-12-25 01:09:12 Re: BUG #18240: Undefined behaviour in cash_mul_flt8() and friends