Re:RE: Re:RE: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()

From: ocean_li_996 <ocean_li_996(at)163(dot)com>
To: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
Cc: "'Alexander Lakhin'" <exclusion(at)gmail(dot)com>, "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>, "feichanghong(at)qq(dot)com" <feichanghong(at)qq(dot)com>, amit(dot)kapila16(at)gmail(dot)com
Subject: Re:RE: Re:RE: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()
Date: 2024-03-11 16:43:01
Message-ID: 5f2487c9.13435.18e2e656942.Coremail.ocean_li_996@163.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Dear Hayato:

At 2024-03-08 15:35:16, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com> wrote:
>Dear Haiyang, Alexander,
>
>Below part describes the second failure. The issue would be occurred when:
>
>1) Logical decoding starts from the middle of a sub-transaction, and
>2) NEW_CID record is initially decoded in the sub-transaction, and
>3) An arbitrary changes are decoded in the sub-transaction.

>...

## Issue 1
Thanks for your reproducer. I have investigated this issue. The scenario that caused the issue
is indeed as you described above. I had not realized before that a serialized snapshot file from
one replication slot could be utilized by another replication slot to achieve a consistent state.
This issue would not arise with only one replication slot, as SnapBuildXactNeedsSkip ensures that.

Using the spec test case you provided (without any fix patch), I got another Stuck trace on REL_14_STABLE :
```
#0 0x00007f93604e0277 in __GI_raise (sig=sig(at)entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
#1 0x00007f93604e1968 in __GI_abort () at abort.c:90
#2 0x0000000000b084da in ExceptionalCondition (conditionName=0xcc94c8 "prev_first_lsn < cur_txn->first_lsn", errorType=0xcc92c4 "FailedAssertion", fileName=0xcc9310 "reorderbuffer.c",
lineNumber=916) at assert.c:69
#3 0x00000000008df064 in AssertTXNLsnOrder (rb=0x2a48590) at reorderbuffer.c:916
#4 0x00000000008de9a1 in ReorderBufferTXNByXid (rb=0x2a48590, xid=756, create=true, is_new=0x0, lsn=24749416, create_as_top=true) at reorderbuffer.c:669
#5 0x00000000008e2d00 in ReorderBufferAddNewTupleCids (rb=0x2a48590, xid=756, lsn=24749416, node=..., tid=..., cmin=1, cmax=4294967295, combocid=4294967295) at reorderbuffer.c:3198
#6 0x00000000008e7ed7 in SnapBuildProcessNewCid (builder=0x2a44570, xid=757, lsn=24749416, xlrec=0x2a16d38) at snapbuild.c:823
#7 0x00000000008d14b3 in DecodeHeap2Op (ctx=0x2a324f0, buf=0x7ffd886a4bc0) at decode.c:471
#8 0x00000000008d0bfa in LogicalDecodingProcessRecord (ctx=0x2a324f0, record=0x2a328f0) at decode.c:151
#9 0x00000000008d8688 in pg_logical_slot_get_changes_guts (fcinfo=0x2a206b0, confirm=true, binary=false) at logicalfuncs.c:296
#10 0x00000000008d87bd in pg_logical_slot_get_changes (fcinfo=0x2a206b0) at logicalfuncs.c:365
```

Then, I think the simplest process would be:
```
ReorderBufferProcessXid(xid = 757, lsn)
-> A ReorderBufferTXN for subtransaciton (757) was generated.
The first_lsn was the head of NEW_CID record.
DecodeHeap2Op(info = XLOG_HEAP2_NEW_CID)
SnapBuildProcessNewCid()
ReorderBufferAddNewTupleCids(..., xlrec->top_xid = 756,...)
ReorderBufferTXNByXid(xid = 756, lsn)
-> A ReorderBufferTXN for top-transaciton (756) was generated.
The first_lsn was the head of NEW_CID record. It caused an Assertion failure.
```
> I think the straightforward fix is to associate them to top-sub relationship,
> and attached patch did it.
LGFM, I think it is suitable assign subtransaction after calling ReorderBufferAddNewTupleCids().

## Issue 2
Inspired by your spec case, I've reorganized the spec case provided in [2]. The new test in attachment
is able to reproduce the issue mentioned in [1] even before commit 6b77048e5.

The approach in [3] is also LGFM.

[1] https://www.postgresql.org/message-id/18369-ad61699bf91c5bc0@postgresql.org
[2] https://www.postgresql.org/message-id/6d0e80d6.c1fc.18deeb8120a.Coremail.ocean_li_996@163.com
[3] https://www.postgresql.org/message-id/CAA4eK1KpW5pHMwMp9hfXYvOeEU5Rcbhoc7FxtBOGPgKeyYLDmA@mail.gmail.com

Best Regards,
Haiyang Li

Attachment Content-Type Size
v2-0001-TestCase-Coredump-On-Assert_TXNLsnOrder.patch application/octet-stream 3.6 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2024-03-11 20:10:08 BUG #18387: Erroneous permission checks and/or misleading error messages with refresh materialized view
Previous Message PG Bug reporting form 2024-03-11 12:48:39 BUG #18386: Slow connection access after 'vacuum full pg_attribute'