Re: Restrict copying of invalidated replication slots

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Restrict copying of invalidated replication slots
Date: 2025-03-17 17:27:08
Message-ID: CAD21AoBtjj_xoH+pgHP_CKykj00fayjFqQ60W6az-Drj2+mOdQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Mar 9, 2025 at 11:16 PM Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com> wrote:
>
> On Fri, 28 Feb 2025 at 08:56, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Fri, Feb 28, 2025 at 5:10 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > >
> > > On Thu, Feb 27, 2025 at 12:52 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > >
> > > > On Thu, Feb 27, 2025 at 10:47 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > > > >
> > > > > On Tue, Feb 25, 2025 at 7:33 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > > > >
> > > > > > AFAICU, InvalidateObsoleteReplicationSlots() is not serialized with
> > > > > > this operation. So, isn't it possible that the source slot exists at
> > > > > > the later position in ReplicationSlotCtl->replication_slots and the
> > > > > > loop traversing slots is already ahead from the point where the newly
> > > > > > copied slot is created?
> > > > >
> > > > > Good point. I think it's possible.
> > > > >
> > > > > > If so, the newly created slot won't be marked
> > > > > > as invalid whereas the source slot will be marked as invalid. I agree
> > > > > > that even in such a case, at a later point, the newly created slot
> > > > > > will also be marked as invalid.
> > > > >
> > > > > The wal_status of the newly created slot would immediately become
> > > > > 'lost' and the next checkpoint will invalidate it. Do we need to do
> > > > > something to deal with this case?
> > > > >
> > > >
> > > > + /* Check if source slot became invalidated during the copy operation */
> > > > + if (second_slot_contents.data.invalidated != RS_INVAL_NONE)
> > > > + ereport(ERROR,
> > > > + errmsg("cannot copy replication slot \"%s\"",
> > > > + NameStr(*src_name)),
> > > > + errdetail("The source replication slot was invalidated during the
> > > > copy operation."));
> > > >
> > > > How about adding a similar test as above for MyReplicationSlot? That
> > > > should suffice the need because we have already acquired the new slot
> > > > by this time and invalidation should signal this process before
> > > > marking the new slot as invalid.
> > >
> > > IIUC in the scenario you mentioned, the loop traversing slots already
> > > passed the position of newly created slot in
> > > ReplicationSlotCtl->replication_slots array, so
> > > MyReplicationSlot->data.invalidated is still RS_INVAL_NONE, no?
> > >
> >
> > Right, I don't have a simpler solution for this apart from making it
> > somehow serialize this operation with
> > InvalidateObsoleteReplicationSlots(). I don't think we need to go that
> > far to handle the scenario being discussed. It is better to add a
> > comment for this race and why it won't harm.
>
> I tried to reproduce the scenario described using the following steps:
>
> Here are the steps:
>
> 1. set max_replication_slots = 2
>
> 2. create two logical replication slot, lets say slot1 and slot2. drop
> the first slot (slot1).
>
> 3. Now run pg_copy_logical_replication slot function on slot2. Lets
> say copied slot is 'slot_cp'.
> In function copy_replication_slot add breakpoint just after function
> 'create_logical_replication_slot'.
> slot_cp will be created. In array
> ReplicationSlotCtl->replication_slots, slot_cp will come before slot2.
>
> 4. Now invalidate the 'slot2'.
> Function 'InvalidateObsoleteReplicationSlots' will be called. Now it
> will loop through ReplicationSlotCtl->replication_slots and will get
> 'slot_cp' first. Since the slot is acquired by copy_replication_slot,
> It will wait for the slot to be released. Once slot is released,
> 'slot_cp' and 'slot2' will be invalidated.
>
> I have tried it with 'wal_removal' invalidation. So it is issue in
> PG_13 to HEAD.
> I also see a kill signal sent to function 'copy_replication_slot'.
>
> Terminal output:
> postgres=# select pg_copy_logical_replication_slot('slot2', 'slot_cp');
> FATAL: terminating connection due to administrator command
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Succeeded.
>
> 5. Execute 'SELECT * from pg_replication_slots'. slot_cp is present in
> the list with wal_status as lost.
>
> I have added the comment on existing patches for REL_16 to master.
> Created a new patch to add only comments in REL13-REL15.

I think we don't need a patch adding only comments to v15 or older. We
can either write the fact in the commit message that v15 or older are
not affected fortunately or add an explicit check if the copied slot
doesn't have invalid restart_lsn.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2025-03-17 17:31:44 Re: Update Unicode data to Unicode 16.0.0
Previous Message Jacob Champion 2025-03-17 17:22:47 Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible