Quick Links

TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

From:	Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)
Date:	2022-09-04 11:04:34
Message-ID:	a89b46b6-0239-2fd5-71a9-b19b1f7a7145@enterprisedb.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

I've been running some valgrind tests on rpi4/aarch64, and I get a crash
in test_decoding ddl test in ~50% runs. I don't see the same failure
without valgrind or on 32-bit system (hundreds of runs, no crashes), so
I suspect this is a race condition, and with valgrind the timing changes
in a way to make it more likely.

The crash always happens in the "ddl" test. The backtrace always looks
like this:

(ExceptionalCondition+0x98)[0x8f6f7c]
(+0x57a7ec)[0x6827ec]
(+0x579edc)[0x681edc]
(ReorderBufferAddNewTupleCids+0x60)[0x686758]
(SnapBuildProcessNewCid+0x94)[0x68b920]
(heap2_decode+0x17c)[0x671584]
(LogicalDecodingProcessRecord+0xbc)[0x670cd0]
(+0x570f88)[0x678f88]
(pg_logical_slot_get_changes+0x1c)[0x6790fc]
(ExecMakeTableFunctionResult+0x29c)[0x4a92c0]
(+0x3be638)[0x4c6638]
(+0x3a2c14)[0x4aac14]
(ExecScan+0x8c)[0x4aaca8]
(+0x3bea14)[0x4c6a14]
(+0x39ea60)[0x4a6a60]
(+0x392378)[0x49a378]
(+0x39520c)[0x49d20c]
(standard_ExecutorRun+0x214)[0x49aad8]
(ExecutorRun+0x64)[0x49a8b8]
(+0x62f53c)[0x73753c]
(PortalRun+0x27c)[0x737198]
(+0x627e78)[0x72fe78]
(PostgresMain+0x9a0)[0x73512c]
(+0x547be8)[0x64fbe8]
(+0x547540)[0x64f540]
(+0x542d30)[0x64ad30]
(PostmasterMain+0x1460)[0x64a574]
(+0x418888)[0x520888]

I'm unable to get a better backtrace from the valgrind-produces core
usign gdb, for some reason.

However, I've modified AssertTXNLsnOrder() - which is where the assert
is checked - to also dump toplevel_by_lsn instead of just triggering the
assert, and the result is always like this:

WARNING: ==============================================
WARNING: txn xid 849 top 0 first 30264752 0/1CDCDB0 final 0 0/0
WARNING: txn xid 848 top 0 first 30264752 0/1CDCDB0 final 0 0/0
WARNING: ==============================================

The LSNs change a bit between the runs, but the failing transactions are
always 848 and 849. Also, both transactions have exactly the same info.

But the very first WAL record for 849 is

ASSIGNMENT xtop 848: subxacts: 849

so it's strange 849 is in the toplevel_by_lsn list at all, because it
clearly is a subxact of 848.

Furthermore, the WAL is almost exactly the same in both cases. Attached
are two dumps from a failed and successful run (only the part related to
these two xids is included). There are very few differences - there is a
PRUNE in the failed case, and a LOCK / RUNNING_XACTS moved a bit.

Any ideas?

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment	Content-Type	Size
wal-crash.log.gz	application/gzip	6.5 KB
wal-ok.log.gz	application/gzip	6.5 KB

Responses

Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639) at 2022-09-04 11:49:29 from Amit Kapila

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Amit Kapila	2022-09-04 11:38:55	Re: pgsql: Add ALTER SUBSCRIPTION ... SKIP.
Previous Message	Zhihong Yu	2022-09-04 10:58:06	Re: freeing LDAPMessage in CheckLDAPAuth