RE: Potential data loss due to race condition during logical replication slot creation

From: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
To: 'Masahiko Sawada' <sawada(dot)mshk(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: "Callahan, Drew" <callaan(at)amazon(dot)com>, "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: RE: Potential data loss due to race condition during logical replication slot creation
Date: 2024-03-27 07:54:12
Message-ID: TYCPR01MB12077ADB13450B422B6161A01F5342@TYCPR01MB12077.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Dear Sawada-san,

Thanks for giving comments and sorry for late reply.

> > If so, one idea to
> > achieve could be that we maintain the highest_running_xid while
> > serailizing the snapshot and then during restore if that
> > highest_running_xid is <= builder->initial_xmin_horizon, then we
> > ignore restoring the snapshot. We already have few such cases handled
> > in SnapBuildRestore().
>
> I think that builder->initial_xmin_horizon could be older than
> highest_running_xid, for example, when there is a logical replication
> slot whose catalog_xmin is old. However, even in this case, we might
> need to ignore restoring the snapshot. For example, a slightly
> modified test case still can cause the same problem.
>
> The test case in the Kuroda-san's v2 patch:
> permutation "s0_init" "s0_begin" "s0_insert1" "s1_init"
> "s2_checkpoint" "s2_get_changes_slot0" "s0_insert2" "s0_commit"
> "s1_get_changes_slot0"\ "s1_get_changes_slot1"
>
> Modified-version test case (add "s0_insert1" between "s0_init" and "s0_begin"):
> permutation "s0_init" "s0_insert1" "s0_begin" "s0_insert1" "s1_init"
> "s2_checkpoint" "s2_get_changes_slot0" "s0_insert2" "s0_commit"
> "s1_get_changes_slot0\ " "s1_get_changes_slot1"

Good catch, I confirmed that the variant still partially output transactions:

```
step s1_get_changes_slot1: SELECT data FROM pg_logical_slot_get_changes('slot1', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
data
-----------------------------------------
BEGIN
table public.tbl: INSERT: val1[integer]:2
COMMIT
(3 rows)
```

/////

I analyzed a bit. In below descriptions, I uses these notations:

txn0 - has a single insert
txn1 - has begin-insert-commit operations.
slot1 is created in parallel

While creating slot0, initial_xmin_horizon was also set, and it come from
GetOldestSafeDecodingTransactionId().
Since there were no slots in the database before, the value was just a next xid.

Then, the pinned xid was assigned to the txn0.
It is correct behavior because slot0 must decode txn0.

Finally, while creating slot1 after the txn1, the snapshot serialized by slot0 was
restored once. txn1 was the recorded as the highest_running, but initial_xmin was
older than it. So the snapshot was restored and txn1 was partially decoded.

initial_xmin_horizon (txn0) < highest_running_xid (txn1)

/////

So, how should we fix? One idea is based on (c), and adds a flag which indicates
an existence of concurrent transactions. The restoration is skipped if both two
flags are true. PSA the PoC patch.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

Attachment Content-Type Size
0001-Serialize-a-flag-and-use-finding_start_point.patch application/octet-stream 15.8 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2024-03-27 08:25:39 BUG #18409: After my windows update, I can not run postgre 16 server
Previous Message Andrew Dunstan 2024-03-27 07:42:21 Re: BUG #18379: LDAP bind password exposed