Quick Links

Re: BUG #17245: Index corruption involving deduplicated entries

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Kamigishi Rei <iijima(dot)yun(at)koumakan(dot)jp>, David Rowley <dgrowley(at)gmail(dot)com>, Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject:	Re: BUG #17245: Index corruption involving deduplicated entries
Date:	2021-10-29 01:19:23
Message-ID:	20211029011923.utmolntkasenzreh@alap3.anarazel.de
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs

Hi,

It's not the cause of this problem, but I did find a minor issue: the retry
path in lazy_scan_prune() looses track of the deleted tuple count when
retrying.

The retry codepath also made me wonder if there could be problems if we do
FreezeMultiXactId() multiple times due to retry. I think we can end up
creating multiple multixactids for the same tuple (if the members change,
which is likely in the retry path). But that should be fine, I think.

On 2021-10-28 16:04:44 -0700, Peter Geoghegan wrote:
> > Didn't 14 change the logic when index vacuums are done? That could cause
> > previously existing issues to manifest with a higher likelihood.
>
> I don't follow. The new logic that skips index vacuuming kicks in 1)
> in an anti-wraparound vacuum emergency, and 2) when there are very few
> LP_DEAD line pointers in the heap. We can rule 1 out, I think, because
> the XIDs we see are in the low millions, and our starting point was a
> database that was upgraded via a dump and reload.

Right.

> The second criteria for skipping index vacuuming (the "less than 2% of
> heap pages have any LP_DEAD items" thing) might well have been hit on
> these tables -- it is after all very common. But I don't see how that
> could matter. We're never going to get to a code path inside
> vacuumlazy.c that sets LP_DEAD items from VACUUM's dead_tuples array
> to LP_UNUSED (how could reached such a code path without also index
> vacuuming, given the way things are set up inside lazy_vacuum()?).
> We're always going to have the opportunity to do index vacuuming with
> any left-behind LP_DEAD line pointers in the next VACUUM -- right
> after the later VACUUM successfully returns from
> lazy_vacuum_all_indexes().

Shrug. It doesn't seem that hard to believe that repeatedly trying to prune
the same page could unearth some bugs. E.g. via the heap_prune_record_unused()
path in heap_prune_chain().

Hm. I assume somebody checked and verified that old_snapshot_threshold is not
in use? Seems unlikely, but wrongly entering that heap_prune_record_unused()
path could certainly cause issues like we're observing.

Greetings,

Andres Freund

In response to

Re: BUG #17245: Index corruption involving deduplicated entries at 2021-10-28 23:04:44 from Peter Geoghegan

Responses

Re: BUG #17245: Index corruption involving deduplicated entries at 2021-10-29 01:43:54 from Peter Geoghegan

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	PG Bug reporting form	2021-10-29 01:27:35	BUG #17253: Composite partition table configuration error
Previous Message	Thomas Munro	2021-10-28 23:57:43	Re: BUG #17245: Index corruption involving deduplicated entries