Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Alexander Lakhin <exclusion(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)
Date: 2023-06-06 15:42:05
Message-ID: bedc421a-3cdd-ca5a-7b9e-bc6da989f500@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 6/6/23 14:00, Alexander Lakhin wrote:
> Hello Tomas,
>
> 06.06.2023 12:56, Tomas Vondra wrote:
>> On 6/6/23 11:00, Alexander Lakhin wrote:
>>> Hello,
>>> ...> With the debug logging added inside AssertTXNLsnOrder() I see:
>>> ctx->snapshot_builder->start_decoding_at: 209807224,
>>> ctx->reader->EndRecPtr: 210043072,
>>> SnapBuildXactNeedsSkip(ctx->snapshot_builder, ctx->reader->EndRecPtr): 0
>>> and inside the loop:
>>> cur_txn->first_lsn: 209792872
>>> cur_txn->first_lsn: 209975744
>>> cur_txn->first_lsn: 210043008
>>> cur_txn->first_lsn: 210043008
>>> and it triggers the Assert.
>>>
>> So what's the prev_first_lsn value for these first_lsn values? How does
>> it change over time? Did you try looking at the pg_waldump for these
>> positions?
>
> With more logging I've got (for another run):
> ReorderBufferTXNByXid| xid: 3397, lsn: c1fbc80
>
> ctx->snapshot_builder->start_decoding_at: c1f2cc0,
> ctx->reader->EndRecPtr: c1fbcc0,
> SnapBuildXactNeedsSkip(ctx->snapshot_builder, ctx->reader->EndRecPtr): 0
> prev_first_lsn: 0, cur_txn->first_lsn: c1fbc80
> prev_first_lsn: c1fbc80, cur_txn->first_lsn: c1fbc80
> TRAP: failed Assert("prev_first_lsn < cur_txn->first_lsn") ...
>
> waldump for 00000001000000000000000C shows:
> grep c1fbc80:
> rmgr: Heap2       len (rec/tot):     60/    60, tx:       3398, lsn:
> 0/0C1FBC80, prev 0/0C1FBC50, desc: NEW_CID rel: 1663/18763/19987, tid:
> 0/1, cmin: 1, cmax: 4294967295, combo: 4294967295
> rmgr: Heap        len (rec/tot):     59/    59, tx:       3398, lsn:
> 0/0C1FBCC0, prev 0/0C1FBC80, desc: INSERT+INIT off: 1, flags: 0x08,
> blkref #0: rel 1663/18763/19987 blk 0
>
> grep '( 3397| 3398)'

I've been able to reproduce this, after messing with the script a little
bit (I had to skip the test_decoding regression tests, because that was
complaining about slots already existing etc).

Anyway, AssertTXNLsnOrder sees these two transactions (before aborting):

26662 0/6462E6F0 (first 0/0)
26661 0/6462E6F0 (first 0/6462E6F0)

where 26661 is the top xact, 26662 is a subxact of 26661. This is
clearly a problem, because we really should not have subxact in this
list once the assignment gets applied.

And the relevant WAL looks like this:

---------------------------------------------------------------------
26662, lsn: 0/645EDAA0, prev 0/645EDA60, desc: ASSIGNMENT xtop 26661:
subxacts: 26662
26662, lsn: 0/645EDAD0, prev 0/645EDAA0, desc: INSERT+INIT off: 1,
flags: 0x08, blkref #0: rel 1663/125447/126835 blk 0
...
0, lsn: 0/6462E5D8, prev 0/6462E2A0, desc: RUNNING_XACTS nextXid
26673 latestCompletedXid 26672 oldestRunningXid 26661; 3 xacts: 26667
26661 26664; 3 subxacts: 26668 26662 26665
...
26662, lsn: 0/6462E6F0, prev 0/6462E678, desc: NEW_CID rel:
1663/125447/126841, tid: 0/1, cmin: 1, cmax: 4294967295, combo: 4294967295
26662, lsn: 0/6462E730, prev 0/6462E6F0, desc: INSERT+INIT off: 1,
flags: 0x08, blkref #0: rel 1663/125447/126841 blk 0
26661, lsn: 0/6462E770, prev 0/6462E730, desc: COMMIT 2023-06-06
16:41:24.442870 CEST; subxacts: 26662
---------------------------------------------------------------------

so the assignment is the *first* thing that happens for these xacts.

However, we skip the assignment, because the log for this call of
get_changes says this:

LOG: logical decoding found consistent point at 0/6462E5D8

so we fail to realize the 26662 is a subxact.

Then when processing the NEW_CID, SnapBuildProcessNewCid chimes in and
does this:

ReorderBufferXidSetCatalogChanges(builder->reorder, xid, lsn);

ReorderBufferAddNewTupleCids(builder->reorder, xlrec->top_xid, lsn,
xlrec->target_locator, xlrec->target_tid,
xlrec->cmin, xlrec->cmax,
xlrec->combocid);

and ReorderBufferAddNewTupleCids() proceeds to enter an entry for the
passed XID (which is xlrec->top_xid, 26661), but with LSN of the WAL
record. But ReorderBufferXidSetCatalogChanges() already did the same
thing for the subxact 26662, as it has no idea it's a subxact (due to
the skipped assignment).

I haven't figured out what exactly is happening / what it should be
doing instead. But it seems wrong to skip the assignment - I wonder if
SnapBuildProcessRunningXacts might be doing that too eagerly.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2023-06-06 15:46:48 Re: Let's make PostgreSQL multi-threaded
Previous Message Rômulo Coutinho 2023-06-06 14:42:36 ERROR: could not determine which collation to use for string comparison