Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum

From: Andres Freund <andres(at)anarazel(dot)de>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Alexander Lakhin <exclusion(at)gmail(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
Date: 2021-12-11 04:57:10
Message-ID: 20211211045710.ljtuu4gfloh754rs@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi,

On 2021-11-22 14:34:36 -0800, Andres Freund wrote:
> I think I'm actually just not sure what the better approach for the
> backbranches is. I'm worried about the size of your patch, I'm worried about
> the craziness of the current architecture (if you can call it that) of
> heap_page_prune() with my patch.

Given the lack of further discussion, let's go with the "minimal" fix and your
stuff later. Attached is further polished patch - no significant changes,
plenty copy-editing stuff.

0001 is the bugfix

0002 adds the additional assertions - I'm wondering about only committing that
in HEAD?

I think your patch should easily be able to use prstate->htsv?

0003 removes the redundant RelationGetNumberOfBlocks() calls. As 0001 does not
change the total number of RelationGetNumberOfBlocks() calls I'm inclined to
just apply this to HEAD?

0004 is a patch that tries to address your point about the GlobalVisTestFor()
placement, sort of

My earlier point that moving it to earlier would be a bad idea because it'd
make the horizon "older" was bogus - GlobalVisTestFor() doesn't itself do
any horizon determination. However moving the GlobalVisTestFor() earlier
still seems wrong - imo the vacuum_set_xid_limits() call should be moved to
later. The attached patch moves both, wrapped in a new function, to just
before the scan in lazy_scan_heap().

This clearly is HEAD only material - I'm only bringing it up here because
the issue of the GlobalVisTestFor() placement was raised in here...

> > > I think it's worth to clean up the regression test I wrote and use it
> > > regardless of which version of the fix we end up choosing. But I'm a bit bit
> > > on the fence - it's quite complicated.
> >
> > +1 to the idea of keeping it somewhere. Without necessarily running it
> > on the BF.
> >
> > > OTOH, I think with the framework in place it'd not be too hard to write a few
> > > more tests for odd pruning scenarios...
> >
> > Can we commit a test like this without running it by default, at all?
> > It's not like it has never been done before.
>
> Is there a reason not to run it if we commit it? IME tests not run by default
> tend to bitrot very quickly. The only issue that I can think of is that it
> might be hard to make the test useful and robust in the presence of
> debug_discard_caches != 0.

I tried pretty hard to get the test to be reliable enough to commit it. But in
the end I think using the index locking as a "control" mechanism just isn't
great - as evidenced by 0003 breaking the test entirely. I think unfortunately
there's not much hope for testing this unless we get wait points - which seems
not too close :(.

Greetings,

Andres Freund

Attachment Content-Type Size
v4-0001-Fix-possible-HOT-corruption-when-RECENTLY_DEAD-ch.patch text/x-diff 8.2 KB
v4-0002-Verify-redirect-items-are-still-correct-after-hea.patch text/x-diff 3.2 KB
v4-0003-heap-pruning-Only-call-BufferGetBlockNumber-once.patch text/x-diff 1.4 KB
v4-0004-wip-vacuum-Perform-horizon-determination-just-bef.patch text/x-diff 12.2 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Andres Freund 2021-12-11 04:58:26 Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
Previous Message Vik Fearing 2021-12-11 03:14:33 Re: BUG #17321: count(*) on a 1,874,554,883 rows partitioned table takes several minutes.