Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Alexander Lakhin <exclusion(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
Date: 2021-11-06 01:20:14
Message-ID: CAEze2WjJPVoWPWGaqi=XX6hR-ZWN2A1bw_1DtD8T-y_v6EU6Lg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Fri, 5 Nov 2021 at 22:25, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>
> On Fri, Nov 5, 2021 at 4:43 AM Matthias van de Meent
> <boekewurm+postgres(at)gmail(dot)com> wrote:
> > I added the attached instrumentation for checking xmin validity, which
> > asserts what I believe are correct claims about the proc
> > infrastructure:
>
> This test case involves partitioning, but also pruning, which is very
> particular about heap tuple headers being a certain way following
> updates. I wonder if we're missing a
> HeapTupleHeaderIndicatesMovedPartitions() test somewhere. Could be in
> heapam/VACUUM/pruning code, or could be somewhere else.

If you watch closely, the second backtrace in [0] (the segfault)
originates from the code that builds the partition bounds based on
relcaches / catalog tables, which are never partitioned. Although it
is indeed in the partition infrastructure, if we'd have a tuple with
HeapTupleHeaderIndicatesMovedPartitions() at that point, then that'd
be a bug (we do not partition catalogs).

But I hit this same segfault earlier while testing, and I deduced that
problem that I hit at that point was that there was that an index
entry could not resolve to a heap tuple (or the scan at partdesc.c:227
otherwise returned NULL where one result was expected); so that tuple
is NULL at partdesc.c:230, and heap_getattr subsequently segfaults
when it dereferences the null tuple pointer to access it's fields.

Due to the blatant visibility horizon confusion, the failing scan
being on the pg_class table, and the test case including aggressive
manual vacuuming of the pg_class table, I assume that the error was
caused by vacuum having removed tuples from pg_class, while other
backends still required / expected access to these tuples.

Kind regards,

Matthias

[0] https://www.postgresql.org/message-id/d5d5af5d-ba46-aff3-9f91-776c70246cc3%40gmail.com

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Andres Freund 2021-11-06 01:59:58 Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
Previous Message Peter Geoghegan 2021-11-05 21:25:12 Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()