Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)
Date: 2022-09-04 11:49:29
Message-ID: CAA4eK1+NNjJMnvRtee_e7pxSv2vaK2=iT-dbJPG1kNN+ELm7Nw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Sep 4, 2022 at 4:34 PM Tomas Vondra
<tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>
> I've been running some valgrind tests on rpi4/aarch64, and I get a crash
> in test_decoding ddl test in ~50% runs. I don't see the same failure
> without valgrind or on 32-bit system (hundreds of runs, no crashes), so
> I suspect this is a race condition, and with valgrind the timing changes
> in a way to make it more likely.
>
> The crash always happens in the "ddl" test. The backtrace always looks
> like this:
>
> (ExceptionalCondition+0x98)[0x8f6f7c]
> (+0x57a7ec)[0x6827ec]
> (+0x579edc)[0x681edc]
> (ReorderBufferAddNewTupleCids+0x60)[0x686758]
> (SnapBuildProcessNewCid+0x94)[0x68b920]
> (heap2_decode+0x17c)[0x671584]
> (LogicalDecodingProcessRecord+0xbc)[0x670cd0]
> (+0x570f88)[0x678f88]
> (pg_logical_slot_get_changes+0x1c)[0x6790fc]
> (ExecMakeTableFunctionResult+0x29c)[0x4a92c0]
> (+0x3be638)[0x4c6638]
> (+0x3a2c14)[0x4aac14]
> (ExecScan+0x8c)[0x4aaca8]
> (+0x3bea14)[0x4c6a14]
> (+0x39ea60)[0x4a6a60]
> (+0x392378)[0x49a378]
> (+0x39520c)[0x49d20c]
> (standard_ExecutorRun+0x214)[0x49aad8]
> (ExecutorRun+0x64)[0x49a8b8]
> (+0x62f53c)[0x73753c]
> (PortalRun+0x27c)[0x737198]
> (+0x627e78)[0x72fe78]
> (PostgresMain+0x9a0)[0x73512c]
> (+0x547be8)[0x64fbe8]
> (+0x547540)[0x64f540]
> (+0x542d30)[0x64ad30]
> (PostmasterMain+0x1460)[0x64a574]
> (+0x418888)[0x520888]
>
> I'm unable to get a better backtrace from the valgrind-produces core
> usign gdb, for some reason.
>
> However, I've modified AssertTXNLsnOrder() - which is where the assert
> is checked - to also dump toplevel_by_lsn instead of just triggering the
> assert, and the result is always like this:
>
> WARNING: ==============================================
> WARNING: txn xid 849 top 0 first 30264752 0/1CDCDB0 final 0 0/0
> WARNING: txn xid 848 top 0 first 30264752 0/1CDCDB0 final 0 0/0
> WARNING: ==============================================
>
> The LSNs change a bit between the runs, but the failing transactions are
> always 848 and 849. Also, both transactions have exactly the same info.
>
> But the very first WAL record for 849 is
>
> ASSIGNMENT xtop 848: subxacts: 849
>
> so it's strange 849 is in the toplevel_by_lsn list at all, because it
> clearly is a subxact of 848.
>

There is no guarantee that toplevel_by_lsn won't have subxact. As per
my understanding, the problem I reported in the email [1] is the same
and we have seen this in BF failures as well. I posted a way to
reproduce it in that email. It seems this is possible if the decoding
gets XLOG_HEAP2_NEW_CID as the first record (belonging to a
subtransaction) after XLOG_RUNNING_XACTS.

[1] - https://www.postgresql.org/message-id/CAA4eK1LK1nxOTL32OP%3DejhPoBsUP4Bvwb3Ly%3DfethyJ-KbaXyw%40mail.gmail.com

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2022-09-04 12:24:15 Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)
Previous Message Amit Kapila 2022-09-04 11:38:55 Re: pgsql: Add ALTER SUBSCRIPTION ... SKIP.