RE: Re:RE: Re:RE: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()

From: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
To: 'ocean_li_996' <ocean_li_996(at)163(dot)com>
Cc: 'Alexander Lakhin' <exclusion(at)gmail(dot)com>, "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>, "feichanghong(at)qq(dot)com" <feichanghong(at)qq(dot)com>, "amit(dot)kapila16(at)gmail(dot)com" <amit(dot)kapila16(at)gmail(dot)com>
Subject: RE: Re:RE: Re:RE: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()
Date: 2024-03-12 10:22:59
Message-ID: TYCPR01MB12077369E4B9B34979378F435F52B2@TYCPR01MB12077.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Dear Haiyang,

Thanks for checking! This reply was still focused only on "Issue 2" in your notation.

>## Issue 2
>Inspired by your spec case, I've reorganized the spec case provided in [2]. The new test in attachment
>is able to reproduce the issue mentioned in [1] even before commit 6b77048e5.

Good findings. I also confirmed the workload could fail after reverting the 6b77048e5.
Also confirmed that the patch [1] could fix the workload as well.

permutation "s0_init" "s0_begin" "s0_savepoint" "s0_create_part1" "s0_savepoint_release"
"s2_init" "s1_checkpoint" "s1_get_changes" "s0_commit" "s2_get_changes"

## Analysis

The point was that the serialized snapshot by another replication slot can be reused.
When the first get_change is called, a consistent snapshot can be serialized because
of the XLOG_RUNNING_XACTS record (see later).
The get_changes for the second slot reuses so that it can read WAL records property.
(If the first slot does not exist, the status of the snapshot would be
SNAPBUILD_BUILDING_SNAPSHOT. So no records are read)

In the second get_changes, below records are read. First (LOCK, RUNNING_XACTS)
pair is generated from the slot creation, and second pair comes from the
CHECKPOINT. I.e., it reads all records from the slot generation.

```
...lsn: 0/01906DB8, prev 0/01906D58, desc: LOCK ...
...lsn: 0/01906DF0, prev 0/01906DB8, desc: RUNNING_XACTS ...
...lsn: 0/01906E30, prev 0/01906DF0, desc: LOCK ...
...lsn: 0/01906E68, prev 0/01906E30, desc: RUNNING_XACTS ...
...lsn: 0/01906EA8, prev 0/01906E68, desc: CHECKPOINT_ONLINE ...
...lsn: 0/01906F20, prev 0/01906EA8, desc: COMMIT ... subxacts: 728; ... inval msgs: ...
```

Also the final COMMIT record contains the info for a subtransaction and
XACT_XINFO_HAS_INVALS flag, so DecodeCommit()->SnapBuildXidSetCatalogChanges()
is called transactions.

After that, two ReorderBufferTXNs are created with the same LSN, it fails the
assertion in AssertTXNLsnOrder().

I will update the patch if above analysis is correct.

>The approach in [3] is also LGFM.

Thanks. I agreed that we should not ease condition for Assert() as much as possible.

[1]: https://www.postgresql.org/message-id/TYCPR01MB1207790E98F0A563280CC39FCF5262%40TYCPR01MB12077.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/global/

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Maxim Boguk 2024-03-12 10:40:31 Re: BUG #18387: Erroneous permission checks and/or misleading error messages with refresh materialized view
Previous Message Laurenz Albe 2024-03-12 07:34:40 Re: BUG #18387: Erroneous permission checks and/or misleading error messages with refresh materialized view