Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum

From: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Alexander Lakhin <exclusion(at)gmail(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
Date: 2021-11-13 15:06:40
Message-ID: 20211113150640.vk5zhjangylufxaa@localhost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

> On Fri, Nov 12, 2021 at 02:46:22PM -0800, Peter Geoghegan wrote:
> On Fri, Nov 12, 2021 at 2:29 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > > Naturally, I also went through the exercise of trying to find a
> > > counterexample, where pruning doesn't see a disconnected tuple as DEAD
> > > in its HTSV. I could not get the assertion to fail with Alexander's
> > > test case, nor with make check-world.
> >
> > I don't think that provides a meaningful coverage. Alexander's test has a
> > quite limited set operations (which e.g. doesn't include an subxacts), and our
> > own tests around subtransactions, and particularly concurrent subtransaction
> > heavy work, is quite, uh, minimal.
>
> It's a start.
>
> We need to be pragmatic here. There is some uncertainty about what
> HTSV might say about a disconnected tuple in the absence of
> corruption, or there is a risk of a new problem like that coming up in
> the future -- let's work within those confines, then. What do you want
> to do about that? There aren't that many choices, since, to repeat,
> the tuple is "morally" DEAD no matter what. Even with corruption, even
> without corruption in the presence of some unanticipated corner case
> with HTSV -- this is fundamental.

I've got curious if modifying the Alexander's test case could reveal
something interesting, and sprinkled it with savepoints and rollbacks.
Almost immediately a new problem has manifested itself, although the
crash has nothing to do with the disconnected tuples as far as I can
tell -- still probably worth mentioning. In this case vacuum invoked
lazy_scan_prune, and during the first scan one of the chains had a
HEAPTUPLE_DEAD at the third position. The processing flow fell through
to heap_prune_record_prunable and crashed on an assert with an
InvalidTransactionId:

#3 0x000055a2b260d1f9 in heap_prune_record_prunable (prstate=0x7ffd0c0ecdf0, xid=0) at pruneheap.c:872
#4 0x000055a2b260ca72 in heap_prune_chain (buffer=2117, rootoffnum=150, prstate=0x7ffd0c0ecdf0) at pruneheap.c:695
#5 0x000055a2b260bcd6 in heap_page_prune (relation=0x7fb98e217e20, buffer=2117, vistest=0x55a2b31d2d60 <GlobalVisCatalogRels>, old_snap_xmin=0, old_snap_ts=0, report_stats=false, off_loc=0x55a2b3e6a0cc) at pruneheap.c:288
#6 0x000055a2b261309c in lazy_scan_prune (vacrel=0x55a2b3e6a060, buf=2117, blkno=192, page=0x7fb97856bf80 "", vistest=0x55a2b31d2d60 <GlobalVisCatalogRels>, prunestate=0x7ffd0c0ee9d0) at vacuumlazy.c:1739

Applying heap_prune_record_prunable only if TransactionIdIsNormal seems
to help. The original implementation didn't reach
heap_prune_record_prunable either and also doesn't crash.

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Peter Geoghegan 2021-11-13 17:06:48 Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
Previous Message PG Bug reporting form 2021-11-13 12:00:01 BUG #17284: Assert failed in SerialAdd() when the summarize_serial mode is engaged