Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>, Bowen Shi <zxwsbg12138(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Alexander Lakhin <exclusion(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae
Date: 2024-06-21 00:33:59
Message-ID: CAAKRu_Z50WSPWLYg-2NC4TDBSyTLMRL_jG=K+txByTAeu5nNXA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Thu, Jun 20, 2024 at 11:49 AM Melanie Plageman
<melanieplageman(at)gmail(dot)com> wrote:
>
> On Tue, Jun 18, 2024 at 6:51 PM Melanie Plageman
> <melanieplageman(at)gmail(dot)com> wrote:
> >
> > Finally, upthread there is discussion of how we could end up doing a
> > catalog lookup after vacuum_get_cutoffs() and before the tuple
> > visibility check on 16. Assuming this is true, we would want to
> > backport the fix to 16 as well. I could use some help getting a repro
> > (using btree index deletion for example) of the infinite loop on 16.
>
> So, I ended up working on a new repro that works by forcing a round of
> index vacuuming after the standby reconnects and before pruning a dead
> tuple whose xmax is older than OldestXmin.
>
> At the end of the round of index vacuuming, _bt_pendingfsm_finalize()
> calls GetOldestNonRemovableTransactionId(), thereby updating the
> backend's GlobalVisState and moving maybe_needed backwards.
>
> Then vacuum's first pass will continue with pruning and find our later
> inserted and updated tuple HEAPTUPLE_RECENTLY_DEAD when compared to
> maybe_needed but HEAPTUPLE_DEAD when compared to OldestXmin.
>
> I make sure that the standby reconnects between vacuum_get_cutoffs()
> (vacuum_set_xid_limits() on 14/15) and pruning because I have a cursor
> on the page keeping VACUUM FREEZE from getting a cleanup lock.
>
> See the repros for step-by-step explanations of how it works.
>
> With this, I can repro the infinite loop on 14-16.
>
> Backporting 1ccc1e05ae fixes 16 but, with the new repro, 14 and 15
> error out with "cannot freeze committed xmax". I'm going to
> investigate further why this is happening. It definitely makes me
> wonder about the fix.

It turns out it was also erroring out on 16 (i.e. backporting
1ccc1e05ae did not fix anything), but I didn't notice it because the
perl TAP test passed. I also discovered we can hit this error in
master, so I started a thread about that here [1].

- Melanie

[1] https://www.postgresql.org/message-id/CAAKRu_bDD7oq9ZwB2OJqub5BovMG6UjEYsoK2LVttadjEqyRGg%40mail.gmail.com

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Michael Paquier 2024-06-21 01:54:09 Re: BUG #18499: Reindexing spgist index concurrently triggers Assert("TransactionIdIsValid(state->myXid)")
Previous Message Tom Lane 2024-06-20 16:43:36 Re: BUG #18517: Dropping a table referenced by an initially deferred foreign key fails with an error