Re: Restrict copying of invalidated replication slots

From: Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Restrict copying of invalidated replication slots
Date: 2025-03-18 09:28:29
Message-ID: CANhcyEUXbG+f3xKJyg6MhJXY9CbMDRYBwwF+dQOQaJq81zeA=g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, 17 Mar 2025 at 22:57, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Sun, Mar 9, 2025 at 11:16 PM Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com> wrote:
> >
> > On Fri, 28 Feb 2025 at 08:56, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > On Fri, Feb 28, 2025 at 5:10 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > > >
> > > > On Thu, Feb 27, 2025 at 12:52 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > > >
> > > > > On Thu, Feb 27, 2025 at 10:47 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > > > > >
> > > > > > On Tue, Feb 25, 2025 at 7:33 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > > > > >
> > > > > > > AFAICU, InvalidateObsoleteReplicationSlots() is not serialized with
> > > > > > > this operation. So, isn't it possible that the source slot exists at
> > > > > > > the later position in ReplicationSlotCtl->replication_slots and the
> > > > > > > loop traversing slots is already ahead from the point where the newly
> > > > > > > copied slot is created?
> > > > > >
> > > > > > Good point. I think it's possible.
> > > > > >
> > > > > > > If so, the newly created slot won't be marked
> > > > > > > as invalid whereas the source slot will be marked as invalid. I agree
> > > > > > > that even in such a case, at a later point, the newly created slot
> > > > > > > will also be marked as invalid.
> > > > > >
> > > > > > The wal_status of the newly created slot would immediately become
> > > > > > 'lost' and the next checkpoint will invalidate it. Do we need to do
> > > > > > something to deal with this case?
> > > > > >
> > > > >
> > > > > + /* Check if source slot became invalidated during the copy operation */
> > > > > + if (second_slot_contents.data.invalidated != RS_INVAL_NONE)
> > > > > + ereport(ERROR,
> > > > > + errmsg("cannot copy replication slot \"%s\"",
> > > > > + NameStr(*src_name)),
> > > > > + errdetail("The source replication slot was invalidated during the
> > > > > copy operation."));
> > > > >
> > > > > How about adding a similar test as above for MyReplicationSlot? That
> > > > > should suffice the need because we have already acquired the new slot
> > > > > by this time and invalidation should signal this process before
> > > > > marking the new slot as invalid.
> > > >
> > > > IIUC in the scenario you mentioned, the loop traversing slots already
> > > > passed the position of newly created slot in
> > > > ReplicationSlotCtl->replication_slots array, so
> > > > MyReplicationSlot->data.invalidated is still RS_INVAL_NONE, no?
> > > >
> > >
> > > Right, I don't have a simpler solution for this apart from making it
> > > somehow serialize this operation with
> > > InvalidateObsoleteReplicationSlots(). I don't think we need to go that
> > > far to handle the scenario being discussed. It is better to add a
> > > comment for this race and why it won't harm.
> >
> > I tried to reproduce the scenario described using the following steps:
> >
> > Here are the steps:
> >
> > 1. set max_replication_slots = 2
> >
> > 2. create two logical replication slot, lets say slot1 and slot2. drop
> > the first slot (slot1).
> >
> > 3. Now run pg_copy_logical_replication slot function on slot2. Lets
> > say copied slot is 'slot_cp'.
> > In function copy_replication_slot add breakpoint just after function
> > 'create_logical_replication_slot'.
> > slot_cp will be created. In array
> > ReplicationSlotCtl->replication_slots, slot_cp will come before slot2.
> >
> > 4. Now invalidate the 'slot2'.
> > Function 'InvalidateObsoleteReplicationSlots' will be called. Now it
> > will loop through ReplicationSlotCtl->replication_slots and will get
> > 'slot_cp' first. Since the slot is acquired by copy_replication_slot,
> > It will wait for the slot to be released. Once slot is released,
> > 'slot_cp' and 'slot2' will be invalidated.
> >
> > I have tried it with 'wal_removal' invalidation. So it is issue in
> > PG_13 to HEAD.
> > I also see a kill signal sent to function 'copy_replication_slot'.
> >
> > Terminal output:
> > postgres=# select pg_copy_logical_replication_slot('slot2', 'slot_cp');
> > FATAL: terminating connection due to administrator command
> > server closed the connection unexpectedly
> > This probably means the server terminated abnormally
> > before or while processing the request.
> > The connection to the server was lost. Attempting reset: Succeeded.
> >
> > 5. Execute 'SELECT * from pg_replication_slots'. slot_cp is present in
> > the list with wal_status as lost.
> >
> > I have added the comment on existing patches for REL_16 to master.
> > Created a new patch to add only comments in REL13-REL15.
>
> I think we don't need a patch adding only comments to v15 or older. We
> can either write the fact in the commit message that v15 or older are
> not affected fortunately or add an explicit check if the copied slot
> doesn't have invalid restart_lsn.
>
I felt adding a message in the commit message would be sufficient. I
have made the changes for the same in [1].

[1]: https://www.postgresql.org/message-id/CANhcyEV%2BwwhquX3J9J5gKVdhirGFhNxDAXJArz3udp94vNsvPA%40mail.gmail.com

Thanks and Regards,
Shlok Kyal

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Verite 2025-03-18 09:36:28 Re: Add Pipelining support in psql
Previous Message Jelte Fennema-Nio 2025-03-18 09:27:38 Re: Add Pipelining support in psql