Re: broken tables on hot standby after migration on PostgreSQL 16 (3x times last month)

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: broken tables on hot standby after migration on PostgreSQL 16 (3x times last month)
Date: 2024-05-17 17:18:17
Message-ID: CAFj8pRB8KRPX_ZznE57zef6nRHz8133xHv1gcWC+xkjMBdCDcQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

pá 17. 5. 2024 v 18:02 odesílatel Peter Geoghegan <pg(at)bowt(dot)ie> napsal:

> On Fri, May 17, 2024 at 9:13 AM Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
> wrote:
> > after migration on PostgreSQL 16 I seen 3x times (about every week)
> broken tables on replica nodes. The query fails with error
> >
> > ERROR: could not access status of transaction 1442871302
> > DETAIL: Could not open file "pg_xact/0560": No such file or directory
>
> You've shown an inconsistency between the primary and standby with
> respect to the heap tuple infomask bits related to freezing. It looks
> like a FREEZE WAL record from the primary was never replayed on the
> standby.
>

It think is possible so broken tuples was created before upgrade from
Postgres 15 to Postgres 16 - not too far before, so this bug can be side
effect of upgrade

>
> It's natural for me to wonder if my Postgres 16 work on page-level
> freezing might be a factor here. If that really was true, then it
> would be necessary to explain why the primary and standby are
> inconsistent (no reason to suspect a problem on the primary here).
> It'd have to be the kind of issue that could be detected mechanically
> using wal_consistency_checking, but wasn't detected that way before
> now -- that seems unlikely.
>
> It's worth considering if the more aggressive behavior around
> relfrozenxid advancement (in 15) and freezing (in 16) has increased
> the likelihood of problems like these in setups that were already
> faulty, in whatever way. The standby database is indeed corrupt, but
> even on 16 it's fairly isolated corruption in practical terms. The
> full extent of the problem is clear once amcheck is run, but only one
> tuple can actually cause the system to error due to the influence of
> hint bits (for better or worse, hint bits mask the problem quite well,
> even on 16).
>
> --
> Peter Geoghegan
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2024-05-17 17:25:21 Re: broken tables on hot standby after migration on PostgreSQL 16 (3x times last month)
Previous Message Greg Sabino Mullane 2024-05-17 17:12:59 Re: commitfest.postgresql.org is no longer fit for purpose