From: | Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Restrict copying of invalidated replication slots |
Date: | 2025-03-10 06:16:09 |
Message-ID: | CANhcyEU1A2o4LtAt=R0D3zB=kZgEgfWRGQpR9BvO4dr20=N5Fg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, 28 Feb 2025 at 08:56, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Fri, Feb 28, 2025 at 5:10 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Thu, Feb 27, 2025 at 12:52 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > On Thu, Feb 27, 2025 at 10:47 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > > >
> > > > On Tue, Feb 25, 2025 at 7:33 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > > >
> > > > > AFAICU, InvalidateObsoleteReplicationSlots() is not serialized with
> > > > > this operation. So, isn't it possible that the source slot exists at
> > > > > the later position in ReplicationSlotCtl->replication_slots and the
> > > > > loop traversing slots is already ahead from the point where the newly
> > > > > copied slot is created?
> > > >
> > > > Good point. I think it's possible.
> > > >
> > > > > If so, the newly created slot won't be marked
> > > > > as invalid whereas the source slot will be marked as invalid. I agree
> > > > > that even in such a case, at a later point, the newly created slot
> > > > > will also be marked as invalid.
> > > >
> > > > The wal_status of the newly created slot would immediately become
> > > > 'lost' and the next checkpoint will invalidate it. Do we need to do
> > > > something to deal with this case?
> > > >
> > >
> > > + /* Check if source slot became invalidated during the copy operation */
> > > + if (second_slot_contents.data.invalidated != RS_INVAL_NONE)
> > > + ereport(ERROR,
> > > + errmsg("cannot copy replication slot \"%s\"",
> > > + NameStr(*src_name)),
> > > + errdetail("The source replication slot was invalidated during the
> > > copy operation."));
> > >
> > > How about adding a similar test as above for MyReplicationSlot? That
> > > should suffice the need because we have already acquired the new slot
> > > by this time and invalidation should signal this process before
> > > marking the new slot as invalid.
> >
> > IIUC in the scenario you mentioned, the loop traversing slots already
> > passed the position of newly created slot in
> > ReplicationSlotCtl->replication_slots array, so
> > MyReplicationSlot->data.invalidated is still RS_INVAL_NONE, no?
> >
>
> Right, I don't have a simpler solution for this apart from making it
> somehow serialize this operation with
> InvalidateObsoleteReplicationSlots(). I don't think we need to go that
> far to handle the scenario being discussed. It is better to add a
> comment for this race and why it won't harm.
I tried to reproduce the scenario described using the following steps:
Here are the steps:
1. set max_replication_slots = 2
2. create two logical replication slot, lets say slot1 and slot2. drop
the first slot (slot1).
3. Now run pg_copy_logical_replication slot function on slot2. Lets
say copied slot is 'slot_cp'.
In function copy_replication_slot add breakpoint just after function
'create_logical_replication_slot'.
slot_cp will be created. In array
ReplicationSlotCtl->replication_slots, slot_cp will come before slot2.
4. Now invalidate the 'slot2'.
Function 'InvalidateObsoleteReplicationSlots' will be called. Now it
will loop through ReplicationSlotCtl->replication_slots and will get
'slot_cp' first. Since the slot is acquired by copy_replication_slot,
It will wait for the slot to be released. Once slot is released,
'slot_cp' and 'slot2' will be invalidated.
I have tried it with 'wal_removal' invalidation. So it is issue in
PG_13 to HEAD.
I also see a kill signal sent to function 'copy_replication_slot'.
Terminal output:
postgres=# select pg_copy_logical_replication_slot('slot2', 'slot_cp');
FATAL: terminating connection due to administrator command
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.
5. Execute 'SELECT * from pg_replication_slots'. slot_cp is present in
the list with wal_status as lost.
I have added the comment on existing patches for REL_16 to master.
Created a new patch to add only comments in REL13-REL15.
Thanks and Regards,
Shlok Kyal
Attachment | Content-Type | Size |
---|---|---|
REL_17_v7-0001-Restrict-copying-of-invalidated-replication-slots.patch | application/octet-stream | 6.0 KB |
REL_16_v7-0001-Restrict-copying-of-invalidated-replication-slots.patch | application/octet-stream | 5.8 KB |
REL_15-REL_13_v7-0001-Comment-race-condition-in-copy_replication_slot.patch | application/octet-stream | 2.0 KB |
master_v7-0001-Restrict-copying-of-invalidated-replication-slots.patch | application/octet-stream | 5.9 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2025-03-10 06:21:04 | Re: Back-patch of: avoid multiple hard links to same WAL file after a crash |
Previous Message | Amit Kapila | 2025-03-10 06:12:02 | Re: Parallel heap vacuum |