Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From: Noah Misch <noah(at)leadboat(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Alexander Lakhin <exclusion(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae
Date: 2024-04-16 19:34:02
Message-ID: 20240416193402.9a.nmisch@google.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Tue, Apr 16, 2024 at 11:01:08AM -0700, Andres Freund wrote:
> On 2024-04-15 20:58:25 -0700, Noah Misch wrote:
> > On Mon, Apr 15, 2024 at 02:10:20PM -0700, Andres Freund wrote:
> > > On 2024-04-15 13:52:04 -0700, Noah Misch wrote:
> > > > I have observed the infinite loop in production with v15.5, so that
> > > > non-reproduce outcome is a limitation in the test procedure. (v14.2 added
> > > > those two commits.)
> > >
> > > How closely have you analyzed those production occurences? It's not too hard
> > > to imagine some form of corruption that leads to such a loop, but which isn't
> > > related to the horizon going backwards? E.g. a corrupted HOT chain can lead
> > > to heap_page_prune() not acting on a DEAD tuple, but lazy_scan_prune() would
> > > then encounter a DEAD tuple.

I've not seen this recur for any one table, so I think we can rule out
corruption modes that would reach the loop every time. (If a hypothesized
loop explanation calls for both corruption and horizon movement, that could
still apply.)

> > One occurrence had these facts:
> >
> > HeapTupleHeaderGetXmin = 95271613
> > HeapTupleHeaderGetUpdateXid = 95280147
> > vacrel->OldestXmin = 95317451
> > vacrel->vistest->definitely_needed = 95318928
> > vacrel->vistest->maybe_needed = 93624425
> >
> > How compatible are those with the corruption vectors you have in view?
>
> Do you have more information about the page this was on? E.g. pageinspect
> output? Or at least the infomasks of that tuple?

No, unfortunately.

> I assume this was a normal
> data table (i.e. not a [shared|user] catalog table or temp table)?

Normal data table

> Do you know what ComputeXidHorizonsResultLastXmin, RecentXmin were set to?

No.

> > I tried briefly to understand
> > https://postgr.es/m/flat/20240415173913(dot)4zyyrwaftujxthf2(at)awork3(dot)anarazel(dot)de
> > but I felt verifying its argument was going to be a big job for me. Would
> > those errors happen transiently, like the infinite loop, or would they
> > persist until something resets the tuple fields (e.g. ATRewriteTables())?
>
> I think they'd be transient, because the visibility information during the
> next vacuum would presumably not be "skewed" anymore?

That is good.

> Of course it's possible
> you'd re-encounter the problem, if you constantly have horizons going back and
> forth. But I'd still classify that as transient.

Certainly.

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Shlok Kyal 2024-04-17 10:48:44 Re: BUG #18433: Logical replication timeout
Previous Message David G. Johnston 2024-04-16 19:20:50 Re: BUG #18440: Query does not prune partitions correctly or use index when prepared statements are used