Re: Restrict copying of invalidated replication slots

From: vignesh C <vignesh21(at)gmail(dot)com>
To: Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Restrict copying of invalidated replication slots
Date: 2025-02-13 10:23:59
Message-ID: CALDaNm2rrxO5mg6OKoScw84K5P1Tw_cbjniHm+Geyxme8Ei-nQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 4 Feb 2025 at 15:27, Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com> wrote:
>
> Hi,
>
> Currently, we can copy an invalidated slot using the function
> 'pg_copy_logical_replication_slot'. As per the suggestion in the
> thread [1], we should prohibit copying of such slots.
>
> I have created a patch to address the issue.

This patch does not fix all the copy_replication_slot scenarios
completely, there is a very corner concurrency case where an
invalidated slot still gets copied:
+ /* We should not copy invalidated replication slots */
+ if (src_isinvalidated)
+ ereport(ERROR,
+
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot copy an invalidated
replication slot")));

Consider the following scenario:
step 1) Set up streaming replication between the primary and standby nodes.
step 2) Create a logical replication slot (test1) on the standby node.
step 3) Have a breakpoint in InvalidatePossiblyObsoleteSlot if cause
is RS_INVAL_WAL_LEVEL, no need to hold other invalidation causes or
add a sleep in InvalidatePossiblyObsoleteSlot function like below:
if (cause == RS_INVAL_WAL_LEVEL)
{
while (bsleep)
sleep(1);
}
step 4) Reduce wal_level on the primary to replica and restart the primary node.
step 5) SELECT 'copy' FROM pg_copy_logical_replication_slot('test1',
'test2'); -- It will wait till the lock held by
InvalidatePossiblyObsoleteSlot is released while trying to create a
slot.
step 6) Increase wal_level back to logical on the primary node and
restart the primary.
step 7) Now allow the invalidation to happen (continue the breakpoint
held at step 3), the replication control lock will be released and the
invalidated slot will be copied

After this:
postgres=# SELECT 'copy' FROM
pg_copy_logical_replication_slot('test1', 'test2');
?column?
----------
copy
(1 row)

-- The invalidated slot (test1) is copied successfully:
postgres=# select * from pg_replication_slots ;
slot_name | plugin | slot_type | datoid | database | temporary
| active | active_pid | xmin | catalog_xmin | restart_lsn |
confirmed_flush_lsn | wal_status | safe_wal_size | two_phas
e | inactive_since | conflicting |
invalidation_reason | failover | synced
-----------+---------------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+---------------+---------
--+----------------------------------+-------------+------------------------+----------+--------
test1 | test_decoding | logical | 5 | postgres | f
| f | | | 745 | 0/4029060 | 0/4029098
| lost | | f
| 2025-02-13 15:26:54.666725+05:30 | t |
wal_level_insufficient | f | f
test2 | test_decoding | logical | 5 | postgres | f
| f | | | 745 | 0/4029060 | 0/4029098
| reserved | | f
| 2025-02-13 15:30:30.477836+05:30 | f |
| f | f
(2 rows)

-- A subsequent attempt to decode changes from the invalidated slot
(test2) fails:
postgres=# SELECT data FROM pg_logical_slot_get_changes('test2', NULL, NULL);
WARNING: detected write past chunk end in TXN 0x5e77e6c6f300
ERROR: logical decoding on standby requires "wal_level" >= "logical"
on the primary

-- Alternatively, the following error may occur:
postgres=# SELECT data FROM pg_logical_slot_get_changes('test2', NULL, NULL);
WARNING: detected write past chunk end in TXN 0x582d1b2d6ef0
data
------------
BEGIN 744
COMMIT 744
(2 rows)

This is an edge case that can occur under specific conditions
involving replication slot invalidation when there is a huge lag
between primary and standby.
There might be a similar concurrency case for wal_removed too.

Regards,
Vignesh

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Dmitry Dolgov 2025-02-13 10:50:14 Re: pg_stat_statements and "IN" conditions
Previous Message Shlok Kyal 2025-02-13 10:20:38 Re: Restrict publishing of partitioned table with a foreign table as partition