Re: Restrict copying of invalidated replication slots

From: Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>
To: vignesh C <vignesh21(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Restrict copying of invalidated replication slots
Date: 2025-02-17 11:31:16
Message-ID: CANhcyEVJpb6+hnk4MPVU3hZBYL=DS4v-PYBZOUoiivrN8Vd_Bw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 13 Feb 2025 at 15:54, vignesh C <vignesh21(at)gmail(dot)com> wrote:
>
> On Tue, 4 Feb 2025 at 15:27, Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com> wrote:
> >
> > Hi,
> >
> > Currently, we can copy an invalidated slot using the function
> > 'pg_copy_logical_replication_slot'. As per the suggestion in the
> > thread [1], we should prohibit copying of such slots.
> >
> > I have created a patch to address the issue.
>
> This patch does not fix all the copy_replication_slot scenarios
> completely, there is a very corner concurrency case where an
> invalidated slot still gets copied:
> + /* We should not copy invalidated replication slots */
> + if (src_isinvalidated)
> + ereport(ERROR,
> +
> (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("cannot copy an invalidated
> replication slot")));
>
> Consider the following scenario:
> step 1) Set up streaming replication between the primary and standby nodes.
> step 2) Create a logical replication slot (test1) on the standby node.
> step 3) Have a breakpoint in InvalidatePossiblyObsoleteSlot if cause
> is RS_INVAL_WAL_LEVEL, no need to hold other invalidation causes or
> add a sleep in InvalidatePossiblyObsoleteSlot function like below:
> if (cause == RS_INVAL_WAL_LEVEL)
> {
> while (bsleep)
> sleep(1);
> }
> step 4) Reduce wal_level on the primary to replica and restart the primary node.
> step 5) SELECT 'copy' FROM pg_copy_logical_replication_slot('test1',
> 'test2'); -- It will wait till the lock held by
> InvalidatePossiblyObsoleteSlot is released while trying to create a
> slot.
> step 6) Increase wal_level back to logical on the primary node and
> restart the primary.
> step 7) Now allow the invalidation to happen (continue the breakpoint
> held at step 3), the replication control lock will be released and the
> invalidated slot will be copied
>
> After this:
> postgres=# SELECT 'copy' FROM
> pg_copy_logical_replication_slot('test1', 'test2');
> ?column?
> ----------
> copy
> (1 row)
>
> -- The invalidated slot (test1) is copied successfully:
> postgres=# select * from pg_replication_slots ;
> slot_name | plugin | slot_type | datoid | database | temporary
> | active | active_pid | xmin | catalog_xmin | restart_lsn |
> confirmed_flush_lsn | wal_status | safe_wal_size | two_phas
> e | inactive_since | conflicting |
> invalidation_reason | failover | synced
> -----------+---------------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+---------------+---------
> --+----------------------------------+-------------+------------------------+----------+--------
> test1 | test_decoding | logical | 5 | postgres | f
> | f | | | 745 | 0/4029060 | 0/4029098
> | lost | | f
> | 2025-02-13 15:26:54.666725+05:30 | t |
> wal_level_insufficient | f | f
> test2 | test_decoding | logical | 5 | postgres | f
> | f | | | 745 | 0/4029060 | 0/4029098
> | reserved | | f
> | 2025-02-13 15:30:30.477836+05:30 | f |
> | f | f
> (2 rows)
>
> -- A subsequent attempt to decode changes from the invalidated slot
> (test2) fails:
> postgres=# SELECT data FROM pg_logical_slot_get_changes('test2', NULL, NULL);
> WARNING: detected write past chunk end in TXN 0x5e77e6c6f300
> ERROR: logical decoding on standby requires "wal_level" >= "logical"
> on the primary
>
> -- Alternatively, the following error may occur:
> postgres=# SELECT data FROM pg_logical_slot_get_changes('test2', NULL, NULL);
> WARNING: detected write past chunk end in TXN 0x582d1b2d6ef0
> data
> ------------
> BEGIN 744
> COMMIT 744
> (2 rows)
>
> This is an edge case that can occur under specific conditions
> involving replication slot invalidation when there is a huge lag
> between primary and standby.
> There might be a similar concurrency case for wal_removed too.
>

Hi Vignesh,

Thanks for reviewing the patch.

I have tested the above scenario and was able to reproduce it. I have
fixed it in the v2 patch.
Currently we are taking a shared lock on ReplicationSlotControlLock.
This issue can be resolved if we take an exclusive lock instead.
Thoughts?

Thanks and Regards,
Shlok Kyal

Attachment Content-Type Size
v2-0001-Restrict-copying-of-invalidated-replication-slots.patch application/octet-stream 4.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Shlok Kyal 2025-02-17 11:34:26 Re: Restrict copying of invalidated replication slots
Previous Message Zhang Mingli 2025-02-17 11:14:59 Re: Proposal to CREATE FOREIGN TABLE LIKE