From: | Noah Misch <noah(at)leadboat(dot)com> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Alexander Lakhin <exclusion(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org> |
Subject: | Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae |
Date: | 2024-04-16 03:58:25 |
Message-ID: | 20240416035825.8e.nmisch@google.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
On Mon, Apr 15, 2024 at 02:10:20PM -0700, Andres Freund wrote:
> On 2024-04-15 13:52:04 -0700, Noah Misch wrote:
> > On Mon, Apr 15, 2024 at 12:35:59PM -0400, Robert Haas wrote:
> > > I propose to remove this open item from
> > > https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items
> > >
> > > On the original thread (BUG #17257), Alexander Lakhin says that he
> > > can't reproduce this after dad1539ae/18b87b201. Based on my analysis
> >
> > I have observed the infinite loop in production with v15.5, so that
> > non-reproduce outcome is a limitation in the test procedure. (v14.2 added
> > those two commits.)
>
> How closely have you analyzed those production occurences? It's not too hard
> to imagine some form of corruption that leads to such a loop, but which isn't
> related to the horizon going backwards? E.g. a corrupted HOT chain can lead
> to heap_page_prune() not acting on a DEAD tuple, but lazy_scan_prune() would
> then encounter a DEAD tuple.
One occurrence had these facts:
HeapTupleHeaderGetXmin = 95271613
HeapTupleHeaderGetUpdateXid = 95280147
vacrel->OldestXmin = 95317451
vacrel->vistest->definitely_needed = 95318928
vacrel->vistest->maybe_needed = 93624425
How compatible are those with the corruption vectors you have in view?
> > > of the code, I suspect that there is a residual bug, or at least that
> > > there was one prior to 6f47f6883151366c031cd6fd4011e66d2c702a90.
> >
> > Can you say more about how 6f47f6883151366c031cd6fd4011e66d2c702a90 mitigated
> > the regression that 1ccc1e05ae introduced? Thanks for discovering that.
>
> Which regression has 1ccc1e05ae actually introduced? As I pointed out
> upthread, the proposed path to corruption doesn't seem to actually lead to
> corruption, "just" an error? Which actually seems considerably better than an
> endless retry loop that cannot be cancelled.
A transient, spurious error is far better than an uninterruptible infinite
loop with a buffer content lock held. If a transient error is the consistent
outcome, I would agree 1ccc1e05ae improved the situation and didn't regress
it. That would close the open item. I tried briefly to understand
https://postgr.es/m/flat/20240415173913(dot)4zyyrwaftujxthf2(at)awork3(dot)anarazel(dot)de
but I felt verifying its argument was going to be a big job for me. Would
those errors happen transiently, like the infinite loop, or would they persist
until something resets the tuple fields (e.g. ATRewriteTables())?
Thanks,
nm
From | Date | Subject | |
---|---|---|---|
Next Message | PG Bug reporting form | 2024-04-16 13:13:42 | BUG #18439: No way to see national language error messages when running UTF8 scripts with psql.exe on Windows |
Previous Message | Masahiko Sawada | 2024-04-16 03:56:30 | Re: Potential data loss due to race condition during logical replication slot creation |