Re: Potential data loss due to race condition during logical replication slot creation

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, "Callahan, Drew" <callaan(at)amazon(dot)com>, "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: Potential data loss due to race condition during logical replication slot creation
Date: 2024-07-09 09:27:30
Message-ID: CAA4eK1KrVfUWw9WnCF191iDDhoiMQ+jmOPExCRLTbq4dPoK9Zw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Tue, Jul 9, 2024 at 11:00 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Mon, Jul 8, 2024 at 11:02 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Mon, Jul 8, 2024 at 6:01 PM Hayato Kuroda (Fujitsu)
> > <kuroda(dot)hayato(at)fujitsu(dot)com> wrote:
> > >
> > > Dear Sawada-san,
> > >
> > > Thanks for creating the patch!
> > >
> > > > I've attached updated patches for HEAD and pg17 for now (I will create
> > > > the patch for other backbranches).
> > > >
> > > > In the patches, I used a different approach in between HEAD and
> > > > backbranches. On HEAD, we store a flag indicating whether or not we
> > > > should skip snapshot restores into the SnapBuild struct and set it
> > > > only while finding the start point. Therefore we have to bump
> > > > SNAPBUILD_VERSION. On backbranches, I used the approach we discussed
> > > > above; store the flag in LogicalDecodingContext and set it when
> > > > creating the LogicalDecodingContext for a new logical slot. A possible
> > > > downside of the approach taken for backbranches is that we implicitly
> > > > require for users to use the same LogicalDecodingContext for both
> > > > initializing the context for a new slot and finding its start point.
> > > > IIUC it was not strictly required. This restriction would not be a
> > > > problem at least in the core, but I'm not sure if there are no
> > > > external extensions that create a logical slot in that way. This is
> > > > the main reason why I used a different approach on HEAD and
> > > > backbranches. Therefore, if it does not matter, I think we can use the
> > > > same approach on all branches, which is better in terms of
> > > > maintainability.
> > >
> > > I want to confirm your point. You meant that you wanted to unify appraoches,
> > > especially you wanted to store the flag in LogicalDecodingContext, rigth?
> >
> > Yes. Ideally I'd like to use the same approach in all branches
> > regardless of how to fix it for better maintainability.
> >

The difference is minor so using slightly different approaches should
be okay. In the ideal case, I agree that using the same approach makes
sense but for future extendability, it is better to keep this new
variable in SnapBuild at least in HEAD and PG17.

>
> I've attached the new version patches for all branches.
>

Few comments:
1.
@@ -650,6 +650,9 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
{
ReplicationSlot *slot = ctx->slot;

+ /* Let snapshot builder start to find the start point */
+ SnapBuildSetFindStartPoint(ctx->snapshot_builder, true);
+
/* Initialize from where to start reading WAL. */
XLogBeginRead(ctx->reader, slot->data.restart_lsn);

@@ -683,6 +686,9 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
if (slot->data.two_phase)
slot->data.two_phase_at = ctx->reader->EndRecPtr;
SpinLockRelease(&slot->mutex);
+
+ /* Complete to find the start point */
+ SnapBuildSetFindStartPoint(ctx->snapshot_builder, false);

I wonder why you didn't choose to set this variable in
AllocateSnapshotBuilder()? If we do so, then we may not need set/reset
in DecodingContextFindStartpoint(). The one advantage of using
set/reset for the minimal time as done here is that we can avoid any
impact of this new variable but I still feel setting it in
AllocateSnapshotBuilder() seems more natural.

2.
Since in this case
+ * the restart LSN could be in the middle of transactions we need to
+ * find the start point where we won't see the data for partial
+ * transactions.

There is a connecting word missing between *transactions* and *we*.
Can we use a slightly different wording like: "Can't use this method
while finding the start point for decoding changes as the restart LSN
would be an arbitrary LSN but we need to find the start point to
extract changes where we won't see the data for partial
transactions."?

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2024-07-09 11:00:29 BUG #18531: err when using 'current of' with incremental COMMIT
Previous Message Masahiko Sawada 2024-07-09 05:29:52 Re: Potential data loss due to race condition during logical replication slot creation