From: | Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: broken tables on hot standby after migration on PostgreSQL 16 (3x times last month) |
Date: | 2024-05-17 20:05:15 |
Message-ID: | CAFj8pRANkt0_dYDRFHmx5un9x+A4LCo9XTmoEzySY5FzxmRq+A@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
pá 17. 5. 2024 v 21:50 odesílatel Andres Freund <andres(at)anarazel(dot)de> napsal:
> Hi,
>
> On 2024-05-17 15:12:31 +0200, Pavel Stehule wrote:
> > after migration on PostgreSQL 16 I seen 3x times (about every week)
> broken
> > tables on replica nodes. The query fails with error
>
> Migrating from what version?
>
I think 14, but it should be verified tomorrow
>
>
> You're saying that the data is correctly accessible on primaries, but
> broken
> on standbys? Is there any difference in how the page looks like on the
> primary
> vs standby?
>
I saved one page from master and standby. Can I send it privately? There
are some private data (although not too sensitive)
>
>
> > ERROR: could not access status of transaction 1442871302
> > DETAIL: Could not open file "pg_xact/0560": No such file or directory
> >
> > verify_heapam reports
> >
> > ^[[Aprd=# select * from verify_heapam('account_login_history') where
> blkno
> > = 179036;
> > blkno | offnum | attnum | msg
> >
> >
> --------+--------+--------+-------------------------------------------------------------------
> > 179036 | 30 | | xmin 1393743382 precedes oldest valid
> > transaction ID 3:1687012112
>
> So that's not just a narrow race...
>
>
> > master
> >
> > (2024-05-17 14:36:57) prd=# SELECT * FROM
> > page_header(get_raw_page('account_login_history', 179036));
> > lsn │ checksum │ flags │ lower │ upper │ special │ pagesize │
> > version │ prune_xid
> >
> ───────────────┼──────────┼───────┼───────┼───────┼─────────┼──────────┼─────────┼───────────
> > A576/810F4CE0 │ 0 │ 4 │ 296 │ 296 │ 8192 │ 8192 │
> > 4 │ 0
> > (1 row)
> >
> >
> > replica
> > prd_aukro=# SELECT * FROM
> page_header(get_raw_page('account_login_history',
> > 179036));
> > lsn | checksum | flags | lower | upper | special | pagesize |
> > version | prune_xid
> >
> ---------------+----------+-------+-------+-------+---------+----------+---------+-----------
> > A56C/63979DA0 | 0 | 0 | 296 | 296 | 8192 | 8192 |
> > 4 | 0
> > (1 row)
>
> Is the replica behind the primary? Or did we somehow end up with diverging
> data? The page LSNs differ by about 40GB...
>
> Is there evidence of failed truncations of the relation in the log? From
> autovacuum?
>
no I did not see these bugs,
>
> Does the data in the readable versions of the tuples on that page actually
> look valid? Is it possibly duplicated data?
>
looks well, I didn't see any strange content
>
>
> I'm basically wondering whether it's possible that we errored out during
> truncation (e.g. due to a file permission issue or such). Due to some
> brokenness in RelationTruncate() that can lead to data divergence between
> primary and standby and to old tuples re-appearing on either.
>
>
> Another question: Do you use pg_repack or such?
>
pg_repack was used 2 months before migration
>
> Greetings,
>
> Andres Freund
>
From | Date | Subject | |
---|---|---|---|
Next Message | Mark Dilger | 2024-05-17 20:08:10 | Re: [PATCH] Improve amcheck to also check UNIQUE constraint in btree index. |
Previous Message | Peter Geoghegan | 2024-05-17 20:03:09 | Re: broken tables on hot standby after migration on PostgreSQL 16 (3x times last month) |