From: | Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> |
---|---|
To: | Alexander Lakhin <exclusion(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639) |
Date: | 2023-06-06 15:42:05 |
Message-ID: | bedc421a-3cdd-ca5a-7b9e-bc6da989f500@enterprisedb.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 6/6/23 14:00, Alexander Lakhin wrote:
> Hello Tomas,
>
> 06.06.2023 12:56, Tomas Vondra wrote:
>> On 6/6/23 11:00, Alexander Lakhin wrote:
>>> Hello,
>>> ...> With the debug logging added inside AssertTXNLsnOrder() I see:
>>> ctx->snapshot_builder->start_decoding_at: 209807224,
>>> ctx->reader->EndRecPtr: 210043072,
>>> SnapBuildXactNeedsSkip(ctx->snapshot_builder, ctx->reader->EndRecPtr): 0
>>> and inside the loop:
>>> cur_txn->first_lsn: 209792872
>>> cur_txn->first_lsn: 209975744
>>> cur_txn->first_lsn: 210043008
>>> cur_txn->first_lsn: 210043008
>>> and it triggers the Assert.
>>>
>> So what's the prev_first_lsn value for these first_lsn values? How does
>> it change over time? Did you try looking at the pg_waldump for these
>> positions?
>
> With more logging I've got (for another run):
> ReorderBufferTXNByXid| xid: 3397, lsn: c1fbc80
>
> ctx->snapshot_builder->start_decoding_at: c1f2cc0,
> ctx->reader->EndRecPtr: c1fbcc0,
> SnapBuildXactNeedsSkip(ctx->snapshot_builder, ctx->reader->EndRecPtr): 0
> prev_first_lsn: 0, cur_txn->first_lsn: c1fbc80
> prev_first_lsn: c1fbc80, cur_txn->first_lsn: c1fbc80
> TRAP: failed Assert("prev_first_lsn < cur_txn->first_lsn") ...
>
> waldump for 00000001000000000000000C shows:
> grep c1fbc80:
> rmgr: Heap2 len (rec/tot): 60/ 60, tx: 3398, lsn:
> 0/0C1FBC80, prev 0/0C1FBC50, desc: NEW_CID rel: 1663/18763/19987, tid:
> 0/1, cmin: 1, cmax: 4294967295, combo: 4294967295
> rmgr: Heap len (rec/tot): 59/ 59, tx: 3398, lsn:
> 0/0C1FBCC0, prev 0/0C1FBC80, desc: INSERT+INIT off: 1, flags: 0x08,
> blkref #0: rel 1663/18763/19987 blk 0
>
> grep '( 3397| 3398)'
I've been able to reproduce this, after messing with the script a little
bit (I had to skip the test_decoding regression tests, because that was
complaining about slots already existing etc).
Anyway, AssertTXNLsnOrder sees these two transactions (before aborting):
26662 0/6462E6F0 (first 0/0)
26661 0/6462E6F0 (first 0/6462E6F0)
where 26661 is the top xact, 26662 is a subxact of 26661. This is
clearly a problem, because we really should not have subxact in this
list once the assignment gets applied.
And the relevant WAL looks like this:
---------------------------------------------------------------------
26662, lsn: 0/645EDAA0, prev 0/645EDA60, desc: ASSIGNMENT xtop 26661:
subxacts: 26662
26662, lsn: 0/645EDAD0, prev 0/645EDAA0, desc: INSERT+INIT off: 1,
flags: 0x08, blkref #0: rel 1663/125447/126835 blk 0
...
0, lsn: 0/6462E5D8, prev 0/6462E2A0, desc: RUNNING_XACTS nextXid
26673 latestCompletedXid 26672 oldestRunningXid 26661; 3 xacts: 26667
26661 26664; 3 subxacts: 26668 26662 26665
...
26662, lsn: 0/6462E6F0, prev 0/6462E678, desc: NEW_CID rel:
1663/125447/126841, tid: 0/1, cmin: 1, cmax: 4294967295, combo: 4294967295
26662, lsn: 0/6462E730, prev 0/6462E6F0, desc: INSERT+INIT off: 1,
flags: 0x08, blkref #0: rel 1663/125447/126841 blk 0
26661, lsn: 0/6462E770, prev 0/6462E730, desc: COMMIT 2023-06-06
16:41:24.442870 CEST; subxacts: 26662
---------------------------------------------------------------------
so the assignment is the *first* thing that happens for these xacts.
However, we skip the assignment, because the log for this call of
get_changes says this:
LOG: logical decoding found consistent point at 0/6462E5D8
so we fail to realize the 26662 is a subxact.
Then when processing the NEW_CID, SnapBuildProcessNewCid chimes in and
does this:
ReorderBufferXidSetCatalogChanges(builder->reorder, xid, lsn);
ReorderBufferAddNewTupleCids(builder->reorder, xlrec->top_xid, lsn,
xlrec->target_locator, xlrec->target_tid,
xlrec->cmin, xlrec->cmax,
xlrec->combocid);
and ReorderBufferAddNewTupleCids() proceeds to enter an entry for the
passed XID (which is xlrec->top_xid, 26661), but with LSN of the WAL
record. But ReorderBufferXidSetCatalogChanges() already did the same
thing for the subxact 26662, as it has no idea it's a subxact (due to
the skipped assignment).
I haven't figured out what exactly is happening / what it should be
doing instead. But it seems wrong to skip the assignment - I wonder if
SnapBuildProcessRunningXacts might be doing that too eagerly.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From | Date | Subject | |
---|---|---|---|
Next Message | Heikki Linnakangas | 2023-06-06 15:46:48 | Re: Let's make PostgreSQL multi-threaded |
Previous Message | Rômulo Coutinho | 2023-06-06 14:42:36 | ERROR: could not determine which collation to use for string comparison |