From: | Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> |
---|---|
To: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639) |
Date: | 2022-09-04 11:04:34 |
Message-ID: | a89b46b6-0239-2fd5-71a9-b19b1f7a7145@enterprisedb.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
I've been running some valgrind tests on rpi4/aarch64, and I get a crash
in test_decoding ddl test in ~50% runs. I don't see the same failure
without valgrind or on 32-bit system (hundreds of runs, no crashes), so
I suspect this is a race condition, and with valgrind the timing changes
in a way to make it more likely.
The crash always happens in the "ddl" test. The backtrace always looks
like this:
(ExceptionalCondition+0x98)[0x8f6f7c]
(+0x57a7ec)[0x6827ec]
(+0x579edc)[0x681edc]
(ReorderBufferAddNewTupleCids+0x60)[0x686758]
(SnapBuildProcessNewCid+0x94)[0x68b920]
(heap2_decode+0x17c)[0x671584]
(LogicalDecodingProcessRecord+0xbc)[0x670cd0]
(+0x570f88)[0x678f88]
(pg_logical_slot_get_changes+0x1c)[0x6790fc]
(ExecMakeTableFunctionResult+0x29c)[0x4a92c0]
(+0x3be638)[0x4c6638]
(+0x3a2c14)[0x4aac14]
(ExecScan+0x8c)[0x4aaca8]
(+0x3bea14)[0x4c6a14]
(+0x39ea60)[0x4a6a60]
(+0x392378)[0x49a378]
(+0x39520c)[0x49d20c]
(standard_ExecutorRun+0x214)[0x49aad8]
(ExecutorRun+0x64)[0x49a8b8]
(+0x62f53c)[0x73753c]
(PortalRun+0x27c)[0x737198]
(+0x627e78)[0x72fe78]
(PostgresMain+0x9a0)[0x73512c]
(+0x547be8)[0x64fbe8]
(+0x547540)[0x64f540]
(+0x542d30)[0x64ad30]
(PostmasterMain+0x1460)[0x64a574]
(+0x418888)[0x520888]
I'm unable to get a better backtrace from the valgrind-produces core
usign gdb, for some reason.
However, I've modified AssertTXNLsnOrder() - which is where the assert
is checked - to also dump toplevel_by_lsn instead of just triggering the
assert, and the result is always like this:
WARNING: ==============================================
WARNING: txn xid 849 top 0 first 30264752 0/1CDCDB0 final 0 0/0
WARNING: txn xid 848 top 0 first 30264752 0/1CDCDB0 final 0 0/0
WARNING: ==============================================
The LSNs change a bit between the runs, but the failing transactions are
always 848 and 849. Also, both transactions have exactly the same info.
But the very first WAL record for 849 is
ASSIGNMENT xtop 848: subxacts: 849
so it's strange 849 is in the toplevel_by_lsn list at all, because it
clearly is a subxact of 848.
Furthermore, the WAL is almost exactly the same in both cases. Attached
are two dumps from a failed and successful run (only the part related to
these two xids is included). There are very few differences - there is a
PRUNE in the failed case, and a LOCK / RUNNING_XACTS moved a bit.
Any ideas?
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment | Content-Type | Size |
---|---|---|
wal-crash.log.gz | application/gzip | 6.5 KB |
wal-ok.log.gz | application/gzip | 6.5 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Amit Kapila | 2022-09-04 11:38:55 | Re: pgsql: Add ALTER SUBSCRIPTION ... SKIP. |
Previous Message | Zhihong Yu | 2022-09-04 10:58:06 | Re: freeing LDAPMessage in CheckLDAPAuth |