RE: Potential data loss due to race condition during logical replication slot creation

From: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
To: 'Masahiko Sawada' <sawada(dot)mshk(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: "Callahan, Drew" <callaan(at)amazon(dot)com>, "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: RE: Potential data loss due to race condition during logical replication slot creation
Date: 2024-07-08 09:01:47
Message-ID: OSBPR01MB25524873AD000CBAD32F4D63F5DA2@OSBPR01MB2552.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Dear Sawada-san,

Thanks for creating the patch!

> I've attached updated patches for HEAD and pg17 for now (I will create
> the patch for other backbranches).
>
> In the patches, I used a different approach in between HEAD and
> backbranches. On HEAD, we store a flag indicating whether or not we
> should skip snapshot restores into the SnapBuild struct and set it
> only while finding the start point. Therefore we have to bump
> SNAPBUILD_VERSION. On backbranches, I used the approach we discussed
> above; store the flag in LogicalDecodingContext and set it when
> creating the LogicalDecodingContext for a new logical slot. A possible
> downside of the approach taken for backbranches is that we implicitly
> require for users to use the same LogicalDecodingContext for both
> initializing the context for a new slot and finding its start point.
> IIUC it was not strictly required. This restriction would not be a
> problem at least in the core, but I'm not sure if there are no
> external extensions that create a logical slot in that way. This is
> the main reason why I used a different approach on HEAD and
> backbranches. Therefore, if it does not matter, I think we can use the
> same approach on all branches, which is better in terms of
> maintainability.

I want to confirm your point. You meant that you wanted to unify appraoches,
especially you wanted to store the flag in LogicalDecodingContext, rigth? I
briefly grepped github repos with "DecodingContextFindStartpoint", and cannot
find such pattern. E.g., [1]. But this point must be pointed by extension developers.

>
> Also, I excluded the test case for the problem that Kuroda-san
> reported[1] since the reported problem occurred due to the same cause
> of the problem originally reported on this thread. The bug that
> decodes only partial transactions could lead to various symptoms. IIUC
> these test cases test the same behavior.

Yes, I also think they are caused by the same root cause, so we can skip.

Comments for HEAD patch:
1.
You must modify test_decoding/meson.build. It was also missing in patches for
backbranches.

2.
The test code needs cleanup. E.g.,
- Steps "s0_insert_cat" and "s0_savepoint" is not used
- Table user_cat is not used
It was also missing in patches for backbranches.

3.
Just to confirm - In SnapshotRestore(), the added attribute is not restored
from the disk. This is intentional because 1) it has been set to false when it
was serilizing to disk and 2) the destination (SnapBuild *builder) is initialized
by palloc0() so it was set to false. Is it right?

Comments for backblanches patch:

1.
According to wikipage [2], new attribute must be at the end of struct.

[1]: https://github.com/cybertec-postgresql/pg_squeeze/blob/master/worker.c#L1614
[2]: https://wiki.postgresql.org/wiki/Committing_checklist

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Sandeep Thakkar 2024-07-08 11:35:06 Re: Error when installing PostgreSQL 16.3.2 under system context NT AUTHORITY\SYSTEM
Previous Message Muhammad Ikram 2024-07-08 06:13:19 Re: Error when installing PostgreSQL 16.3.2 under system context NT AUTHORITY\SYSTEM