test_decoding assertion failure for the loss of top-sub transaction relationship

From: "osumi(dot)takamichi(at)fujitsu(dot)com" <osumi(dot)takamichi(at)fujitsu(dot)com>
To: "'pgsql-hackers(at)lists(dot)postgresql(dot)org'" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: test_decoding assertion failure for the loss of top-sub transaction relationship
Date: 2022-09-02 00:56:43
Message-ID: TYCPR01MB83733C6CEAE47D0280814D5AED7A9@TYCPR01MB8373.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi, hackers

I've met an assertion failure of logical decoding with below scenario on HEAD.

---
<preparation>
create table tab1 (val integer);
select 'init' from pg_create_logical_replication_slot('regression_slot', 'test_decoding');

<session1>
begin;
savepoint sp1;
insert into tab1 values (1);

<session2>
checkpoint; -- for RUNNING_XACT
select data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');

<session1>
truncate tab1; -- for NEW_CID
commit;
begin;
insert into tab1 values (3);

<session2>
checkpoint;
select data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');

<session1>
commit;

<session2>

select data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
---

Here, it's not a must but is advisable to make LOG_SNAPSHOT_INTERVAL_MS bigger so that
we can issue RUNNING_XACT according to our checkpoint commands explicitly.

In the above scenario, the first checkpoint generates RUNNING_XACT after the wal record
(for ReorderBufferAssignChild) that associates sub transaction with its top transaction.
This means that once we restart from RUNNING_XACT, we lose the association between top
transaction and sub transaction and then we can't mark the top transaction as catalog
modifying transaction by decoding NEW_CID (written after RUNNING_XACT), if the
sub transaction changes the catalog.

Therefore, this leads to the failure for the assert that can check
the consistency that when one sub transaction modifies the catalog,
its top transaction should be marked so as well.

I feel we need to remember the relationship between top transaction and sub transaction
in the serialized snapshot even before changing catalog at decoding RUNNING_XACT,
so that we can keep track of the association after the restart. What do you think ?

The stack call of this failure and related information is below.

(gdb) bt
#0 0x00007f2632588387 in raise () from /lib64/libc.so.6
#1 0x00007f2632589a78 in abort () from /lib64/libc.so.6
#2 0x0000000000b3eba1 in ExceptionalCondition (conditionName=0xd137e0 "!needs_snapshot || needs_timetravel",
errorType=0xd130c5 "FailedAssertion", fileName=0xd130b9 "snapbuild.c", lineNumber=1116) at assert.c:69
#3 0x0000000000911257 in SnapBuildCommitTxn (builder=0x23f0638, lsn=22386632, xid=728, nsubxacts=1,
subxacts=0x2bfcc88, xinfo=79) at snapbuild.c:1116
#4 0x00000000008fa420 in DecodeCommit (ctx=0x23e0108, buf=0x7fff4a1f9220, parsed=0x7fff4a1f9020, xid=728,
two_phase=false) at decode.c:630
#5 0x00000000008f9953 in xact_decode (ctx=0x23e0108, buf=0x7fff4a1f9220) at decode.c:216
#6 0x00000000008f967d in LogicalDecodingProcessRecord (ctx=0x23e0108, record=0x23e04a0) at decode.c:119
#7 0x0000000000900b63 in pg_logical_slot_get_changes_guts (fcinfo=0x23d80a8, confirm=true, binary=false)
at logicalfuncs.c:271
#8 0x0000000000900ca0 in pg_logical_slot_get_changes (fcinfo=0x23d80a8) at logicalfuncs.c:338
...
(gdb) frame 3
#3 0x0000000000911257 in SnapBuildCommitTxn (builder=0x23f0638, lsn=22386632, xid=728, nsubxacts=1,
subxacts=0x2bfcc88, xinfo=79) at snapbuild.c:1116
1116 Assert(!needs_snapshot || needs_timetravel);
(gdb) list
1111 {
1112 /* record that we cannot export a general snapshot anymore */
1113 builder->committed.includes_all_transactions = false;
1114 }
1115
1116 Assert(!needs_snapshot || needs_timetravel);
1117
1118 /*
1119 * Adjust xmax of the snapshot builder, we only do that for committed,
1120 * catalog modifying, transactions, everything else isn't interesting for

Best Regards,
Takamichi Osumi

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2022-09-02 00:58:37 Re: Fix possible bogus array out of bonds (src/backend/access/brin/brin_minmax_multi.c)
Previous Message Ranier Vilela 2022-09-02 00:55:25 Re: Fix possible bogus array out of bonds (src/backend/access/brin/brin_minmax_multi.c)