From: | Tomas Vondra <tomas(at)vondra(dot)me> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
Cc: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Subject: | Re: Snapshot related assert failure on skink |
Date: | 2025-03-17 11:36:55 |
Message-ID: | f357793e-d5b0-4e56-9676-c61ddb3640b7@vondra.me |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 3/4/25 23:25, Andres Freund wrote:
> Hi,
>
> I just saw a BF failure on skink (valgrind) that asserts out.
>
> Check the 002_compare_backups failure in:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2025-03-04%2017%3A35%3A01
>
> TRAP: failed Assert("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)"), File: "../pgsql/src/backend/storage/ipc/procarray.c", Line: 2132, PID: 3115649
> postgres: pitr2: bf postgres [local] SELECT(ExceptionalCondition+0x5f) [0x6e45e0]
> postgres: pitr2: bf postgres [local] SELECT(+0x467c3d) [0x56fc3d]
> postgres: pitr2: bf postgres [local] SELECT(GetSnapshotData+0x60) [0x570fbd]
> postgres: pitr2: bf postgres [local] SELECT(GetNonHistoricCatalogSnapshot+0x53) [0x7285c5]
> postgres: pitr2: bf postgres [local] SELECT(GetCatalogSnapshot+0x28) [0x72969b]
> postgres: pitr2: bf postgres [local] SELECT(systable_beginscan+0xc9) [0x252602]
> postgres: pitr2: bf postgres [local] SELECT(+0x5c1e56) [0x6c9e56]
> postgres: pitr2: bf postgres [local] SELECT(+0x5c20e9) [0x6ca0e9]
> postgres: pitr2: bf postgres [local] SELECT(SearchCatCache+0x18) [0x6ca71c]
> postgres: pitr2: bf postgres [local] SELECT(SearchSysCache+0x21) [0x6deeae]
> postgres: pitr2: bf postgres [local] SELECT(GetSysCacheOid+0x23) [0x6df219]
> postgres: pitr2: bf postgres [local] SELECT(get_namespace_oid+0x2f) [0x2dbae0]
> postgres: pitr2: bf postgres [local] SELECT(+0x1d3c13) [0x2dbc13]
> postgres: pitr2: bf postgres [local] SELECT(+0x1d3e1b) [0x2dbe1b]
> postgres: pitr2: bf postgres [local] SELECT(+0x1d3e72) [0x2dbe72]
> postgres: pitr2: bf postgres [local] SELECT(FuncnameGetCandidates+0xaf) [0x2ddef1]
> postgres: pitr2: bf postgres [local] SELECT(func_get_detail+0x96) [0x4bf521]
> postgres: pitr2: bf postgres [local] SELECT(ParseFuncOrColumn+0x670) [0x4c13b9]
> postgres: pitr2: bf postgres [local] SELECT(+0x3b1123) [0x4b9123]
> postgres: pitr2: bf postgres [local] SELECT(+0x3b12ec) [0x4b92ec]
> postgres: pitr2: bf postgres [local] SELECT(+0x3b2c60) [0x4bac60]
> postgres: pitr2: bf postgres [local] SELECT(+0x3b12dc) [0x4b92dc]
> postgres: pitr2: bf postgres [local] SELECT(transformExpr+0x20) [0x4b8faa]
> postgres: pitr2: bf postgres [local] SELECT(transformTargetEntry+0x7c) [0x4cf4f9]
> postgres: pitr2: bf postgres [local] SELECT(transformTargetList+0x96) [0x4cf5a6]
> postgres: pitr2: bf postgres [local] SELECT(+0x39a5b9) [0x4a25b9]
> postgres: pitr2: bf postgres [local] SELECT(transformStmt+0x141) [0x4a304b]
> postgres: pitr2: bf postgres [local] SELECT(+0x39c390) [0x4a4390]
> postgres: pitr2: bf postgres [local] SELECT(transformTopLevelStmt+0x19) [0x4a43cc]
> postgres: pitr2: bf postgres [local] SELECT(parse_analyze_fixedparams+0x56) [0x4a4424]
> postgres: pitr2: bf postgres [local] SELECT(pg_analyze_and_rewrite_fixedparams+0x40) [0x59b021]
> postgres: pitr2: bf postgres [local] SELECT(+0x4937f6) [0x59b7f6]
> postgres: pitr2: bf postgres [local] SELECT(PostgresMain+0x940) [0x59db54]
> postgres: pitr2: bf postgres [local] SELECT(BackendMain+0x59) [0x596b38]
> postgres: pitr2: bf postgres [local] SELECT(postmaster_child_launch+0xeb) [0x4ee282]
> postgres: pitr2: bf postgres [local] SELECT(+0x3e9271) [0x4f1271]
> postgres: pitr2: bf postgres [local] SELECT(+0x3eadb6) [0x4f2db6]
> postgres: pitr2: bf postgres [local] SELECT(PostmasterMain+0x118b) [0x4f43b2]
> postgres: pitr2: bf postgres [local] SELECT(main+0x1e2) [0x421da2]
> /lib/x86_64-linux-gnu/libc.so.6(+0x29ca8) [0x5a27ca8]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x5a27d65]
> postgres: pitr2: bf postgres [local] SELECT(_start+0x21) [0x1eb421]
> 2025-03-04 19:11:37.941 UTC [3104796][postmaster][:0] LOG: client backend (PID 3115649) was terminated by signal 6: Aborted
> 2025-03-04 19:11:37.941 UTC [3104796][postmaster][:0] DETAIL: Failed process was running: SELECT NOT pg_is_in_recovery();
> 2025-03-04 19:11:37.943 UTC [3104796][postmaster][:0] LOG: terminating any other active server processes
>
>
> The code directly in question hasn't changed in a few years. I haven't seen
> this crash before. The only recent-ish changes in the wider area are:
>
> commit 952365cded6
> Author: Heikki Linnakangas <heikki(dot)linnakangas(at)iki(dot)fi>
> Date: 2024-12-23 12:42:39 +0200
>
> Remove unnecessary GetTransactionSnapshot() calls
>
> commit 578a7fe7b6f8484f6d7caa2fda288abb3fe87aa0
> Author: Heikki Linnakangas <heikki(dot)linnakangas(at)iki(dot)fi>
> Date: 2024-12-21 23:42:39 +0200
>
> Update TransactionXmin when MyProc->xmin is updated
>
> but neither is immediately yelling "here" to me.
>
>
> I don't really see how pg_combinebackup could lead to this issue either.
>
>
> One potentially interesting bit is that the statement is running *during* a
> shutdown checkpoint:
>
> 2025-03-04 19:11:37.377 UTC [3106505][checkpointer][:0] LOG: checkpoint starting: end-of-recovery immediate wait
> 2025-03-04 19:11:37.454 UTC [3115649][client backend][2/2:0] LOG: statement: SELECT NOT pg_is_in_recovery();
> TRAP: failed Assert("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)"), File: "../pgsql/src/backend/storage/ipc/procarray.c", Line: 2132, PID: 3115649
> ...
> 2025-03-04 19:11:37.585 UTC [3106505][checkpointer][:0] LOG: checkpoint complete: wrote 0 buffers (0.0%), wrote 3 SLRU buffers; 0 WAL file(s) added, 0 removed, 0 recycled; write=0.039 s, sync=0.010 s, total=0.198 s; sync files=2, longest=0.006 s, average=0.004 s; distance=16384 kB, estimate=16384 kB; lsn=0/5000060, redo lsn=0/5000060
>
>
> I wonder if the problem is related to a transaction first using a snapshot
> started in recovery and then a "normal" snapshot?
>
Hi,
Alexander Lakhin pointed me to this thread after I posted about hitting
the same assert [1].
[1]
https://www.postgresql.org/message-id/3aee3535-184d-423e-a5eb-05161e8b5617%40vondra.me
It very much seems like the same issue - the backtrace seems to match
exactly, except for the query executed ofc. The exact mechanism may be a
big different - in my stress test (restarting the primary/standby in
different ways, while running pgbench on both instances, etc.) there's
no end-of-recovery checkpoint, for example.
In my tests, 100% of the failures hit the assert immediately after these
messages in the server log (all failures are on the standby)
LOG: database system is ready to accept read-only connections
LOG: started streaming WAL from primary at 2/86000000 on timeline 1
And by "immediately" I mean it's the very next thing in the log. It
seems something happens (to the RecentXmin?) when opening the standby
for read-only queries that breaks an existing snapshot ... I'm not sure
if the end-of-recovery checkpoint does something similar.
I'm still fiddling with the script, trying to increase the probability
of the (apparent) race condition. On one machine (old Xeon) I can hit it
very easily/reliably, while on a different machine (new Ryzen) it's very
rare. I don't know if that's due to difference in speed of the CPU, or
fewer cores, ... I guess it changes the timing just enough.
I've also tried running the stress test on PG17, and I'm yet to see a
single failure there. Not even on the xeon machine, that hits it
reliably on 18. So this seems to be a PG18-only issue.
If needed, I can try adding more logging, or test a patch.
regards
--
Tomas Vondra
From | Date | Subject | |
---|---|---|---|
Next Message | Junwang Zhao | 2025-03-17 11:42:38 | Re: general purpose array_sort |
Previous Message | Nazir Bilal Yavuz | 2025-03-17 11:36:46 | Re: [PoC] Federated Authn/z with OAUTHBEARER |