Re: Potential data loss due to race condition during logical replication slot creation

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "Callahan, Drew" <callaan(at)amazon(dot)com>, "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: Potential data loss due to race condition during logical replication slot creation
Date: 2024-07-09 05:29:52
Message-ID: CAD21AoBV7odVHcTsXS2VNUKqw3J0AB_jaRNhL0UuXJ7=kOkLFw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Mon, Jul 8, 2024 at 11:02 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Mon, Jul 8, 2024 at 6:01 PM Hayato Kuroda (Fujitsu)
> <kuroda(dot)hayato(at)fujitsu(dot)com> wrote:
> >
> > Dear Sawada-san,
> >
> > Thanks for creating the patch!
> >
> > > I've attached updated patches for HEAD and pg17 for now (I will create
> > > the patch for other backbranches).
> > >
> > > In the patches, I used a different approach in between HEAD and
> > > backbranches. On HEAD, we store a flag indicating whether or not we
> > > should skip snapshot restores into the SnapBuild struct and set it
> > > only while finding the start point. Therefore we have to bump
> > > SNAPBUILD_VERSION. On backbranches, I used the approach we discussed
> > > above; store the flag in LogicalDecodingContext and set it when
> > > creating the LogicalDecodingContext for a new logical slot. A possible
> > > downside of the approach taken for backbranches is that we implicitly
> > > require for users to use the same LogicalDecodingContext for both
> > > initializing the context for a new slot and finding its start point.
> > > IIUC it was not strictly required. This restriction would not be a
> > > problem at least in the core, but I'm not sure if there are no
> > > external extensions that create a logical slot in that way. This is
> > > the main reason why I used a different approach on HEAD and
> > > backbranches. Therefore, if it does not matter, I think we can use the
> > > same approach on all branches, which is better in terms of
> > > maintainability.
> >
> > I want to confirm your point. You meant that you wanted to unify appraoches,
> > especially you wanted to store the flag in LogicalDecodingContext, rigth?
>
> Yes. Ideally I'd like to use the same approach in all branches
> regardless of how to fix it for better maintainability.
>
> > I
> > briefly grepped github repos with "DecodingContextFindStartpoint", and cannot
> > find such pattern. E.g., [1]. But this point must be pointed by extension developers.
>
> Thank you for searching on github. It will also affect the future
> extension developments so I'm slightly hesitant to do that.
>
> > >
> > > Also, I excluded the test case for the problem that Kuroda-san
> > > reported[1] since the reported problem occurred due to the same cause
> > > of the problem originally reported on this thread. The bug that
> > > decodes only partial transactions could lead to various symptoms. IIUC
> > > these test cases test the same behavior.
> >
> > Yes, I also think they are caused by the same root cause, so we can skip.
> >
> > Comments for HEAD patch:
> > 1.
> > You must modify test_decoding/meson.build. It was also missing in patches for
> > backbranches.
> >
> > 2.
> > The test code needs cleanup. E.g.,
> > - Steps "s0_insert_cat" and "s0_savepoint" is not used
> > - Table user_cat is not used
> > It was also missing in patches for backbranches.
>
> Will fix the above points in the next version patch.
>
> >
> > 3.
> > Just to confirm - In SnapshotRestore(), the added attribute is not restored
> > from the disk. This is intentional because 1) it has been set to false when it
> > was serilizing to disk and 2) the destination (SnapBuild *builder) is initialized
> > by palloc0() so it was set to false. Is it right?
>
> Right.
>
> > Comments for backblanches patch:
> >
> > 1.
> > According to wikipage [2], new attribute must be at the end of struct.
>
> Good point. Will fix.
>

I've attached the new version patches for all branches.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
REL14_v2-0001-Fix-possibility-of-logical-decoding-partial-trans.patch application/octet-stream 11.7 KB
master_v2-0001-Fix-possibility-of-logical-decoding-partial-trans.patch application/octet-stream 12.3 KB
REL16_v2-0001-Fix-possibility-of-logical-decoding-partial-trans.patch application/octet-stream 12.0 KB
REL15_v2-0001-Fix-possibility-of-logical-decoding-partial-trans.patch application/octet-stream 11.7 KB
REL17_v2-0001-Fix-possibility-of-logical-decoding-partial-trans.patch application/octet-stream 12.3 KB
REL12_v2-0001-Fix-possibility-of-logical-decoding-partial-trans.patch application/octet-stream 11.7 KB
REL13_v2-0001-Fix-possibility-of-logical-decoding-partial-trans.patch application/octet-stream 11.7 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Amit Kapila 2024-07-09 09:27:30 Re: Potential data loss due to race condition during logical replication slot creation
Previous Message Melanie Plageman 2024-07-08 21:23:52 Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae