From: | vignesh C <vignesh21(at)gmail(dot)com> |
---|---|
To: | Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Restrict copying of invalidated replication slots |
Date: | 2025-02-13 10:23:59 |
Message-ID: | CALDaNm2rrxO5mg6OKoScw84K5P1Tw_cbjniHm+Geyxme8Ei-nQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, 4 Feb 2025 at 15:27, Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com> wrote:
>
> Hi,
>
> Currently, we can copy an invalidated slot using the function
> 'pg_copy_logical_replication_slot'. As per the suggestion in the
> thread [1], we should prohibit copying of such slots.
>
> I have created a patch to address the issue.
This patch does not fix all the copy_replication_slot scenarios
completely, there is a very corner concurrency case where an
invalidated slot still gets copied:
+ /* We should not copy invalidated replication slots */
+ if (src_isinvalidated)
+ ereport(ERROR,
+
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot copy an invalidated
replication slot")));
Consider the following scenario:
step 1) Set up streaming replication between the primary and standby nodes.
step 2) Create a logical replication slot (test1) on the standby node.
step 3) Have a breakpoint in InvalidatePossiblyObsoleteSlot if cause
is RS_INVAL_WAL_LEVEL, no need to hold other invalidation causes or
add a sleep in InvalidatePossiblyObsoleteSlot function like below:
if (cause == RS_INVAL_WAL_LEVEL)
{
while (bsleep)
sleep(1);
}
step 4) Reduce wal_level on the primary to replica and restart the primary node.
step 5) SELECT 'copy' FROM pg_copy_logical_replication_slot('test1',
'test2'); -- It will wait till the lock held by
InvalidatePossiblyObsoleteSlot is released while trying to create a
slot.
step 6) Increase wal_level back to logical on the primary node and
restart the primary.
step 7) Now allow the invalidation to happen (continue the breakpoint
held at step 3), the replication control lock will be released and the
invalidated slot will be copied
After this:
postgres=# SELECT 'copy' FROM
pg_copy_logical_replication_slot('test1', 'test2');
?column?
----------
copy
(1 row)
-- The invalidated slot (test1) is copied successfully:
postgres=# select * from pg_replication_slots ;
slot_name | plugin | slot_type | datoid | database | temporary
| active | active_pid | xmin | catalog_xmin | restart_lsn |
confirmed_flush_lsn | wal_status | safe_wal_size | two_phas
e | inactive_since | conflicting |
invalidation_reason | failover | synced
-----------+---------------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+---------------+---------
--+----------------------------------+-------------+------------------------+----------+--------
test1 | test_decoding | logical | 5 | postgres | f
| f | | | 745 | 0/4029060 | 0/4029098
| lost | | f
| 2025-02-13 15:26:54.666725+05:30 | t |
wal_level_insufficient | f | f
test2 | test_decoding | logical | 5 | postgres | f
| f | | | 745 | 0/4029060 | 0/4029098
| reserved | | f
| 2025-02-13 15:30:30.477836+05:30 | f |
| f | f
(2 rows)
-- A subsequent attempt to decode changes from the invalidated slot
(test2) fails:
postgres=# SELECT data FROM pg_logical_slot_get_changes('test2', NULL, NULL);
WARNING: detected write past chunk end in TXN 0x5e77e6c6f300
ERROR: logical decoding on standby requires "wal_level" >= "logical"
on the primary
-- Alternatively, the following error may occur:
postgres=# SELECT data FROM pg_logical_slot_get_changes('test2', NULL, NULL);
WARNING: detected write past chunk end in TXN 0x582d1b2d6ef0
data
------------
BEGIN 744
COMMIT 744
(2 rows)
This is an edge case that can occur under specific conditions
involving replication slot invalidation when there is a huge lag
between primary and standby.
There might be a similar concurrency case for wal_removed too.
Regards,
Vignesh
From | Date | Subject | |
---|---|---|---|
Next Message | Dmitry Dolgov | 2025-02-13 10:50:14 | Re: pg_stat_statements and "IN" conditions |
Previous Message | Shlok Kyal | 2025-02-13 10:20:38 | Re: Restrict publishing of partitioned table with a foreign table as partition |