Re: BUG #17245: Index corruption involving deduplicated entries

From: Andres Freund <andres(at)anarazel(dot)de>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Kamigishi Rei <iijima(dot)yun(at)koumakan(dot)jp>, David Rowley <dgrowley(at)gmail(dot)com>, Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #17245: Index corruption involving deduplicated entries
Date: 2021-10-29 01:19:23
Message-ID: 20211029011923.utmolntkasenzreh@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi,

It's not the cause of this problem, but I did find a minor issue: the retry
path in lazy_scan_prune() looses track of the deleted tuple count when
retrying.

The retry codepath also made me wonder if there could be problems if we do
FreezeMultiXactId() multiple times due to retry. I think we can end up
creating multiple multixactids for the same tuple (if the members change,
which is likely in the retry path). But that should be fine, I think.

On 2021-10-28 16:04:44 -0700, Peter Geoghegan wrote:
> > Didn't 14 change the logic when index vacuums are done? That could cause
> > previously existing issues to manifest with a higher likelihood.
>
> I don't follow. The new logic that skips index vacuuming kicks in 1)
> in an anti-wraparound vacuum emergency, and 2) when there are very few
> LP_DEAD line pointers in the heap. We can rule 1 out, I think, because
> the XIDs we see are in the low millions, and our starting point was a
> database that was upgraded via a dump and reload.

Right.

> The second criteria for skipping index vacuuming (the "less than 2% of
> heap pages have any LP_DEAD items" thing) might well have been hit on
> these tables -- it is after all very common. But I don't see how that
> could matter. We're never going to get to a code path inside
> vacuumlazy.c that sets LP_DEAD items from VACUUM's dead_tuples array
> to LP_UNUSED (how could reached such a code path without also index
> vacuuming, given the way things are set up inside lazy_vacuum()?).
> We're always going to have the opportunity to do index vacuuming with
> any left-behind LP_DEAD line pointers in the next VACUUM -- right
> after the later VACUUM successfully returns from
> lazy_vacuum_all_indexes().

Shrug. It doesn't seem that hard to believe that repeatedly trying to prune
the same page could unearth some bugs. E.g. via the heap_prune_record_unused()
path in heap_prune_chain().

Hm. I assume somebody checked and verified that old_snapshot_threshold is not
in use? Seems unlikely, but wrongly entering that heap_prune_record_unused()
path could certainly cause issues like we're observing.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2021-10-29 01:27:35 BUG #17253: Composite partition table configuration error
Previous Message Thomas Munro 2021-10-28 23:57:43 Re: BUG #17245: Index corruption involving deduplicated entries