From: | Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com> |
---|---|
To: | Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com> |
Cc: | shveta malik <shveta(dot)malik(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Introduce XID age and inactive timeout based replication slot invalidation |
Date: | 2024-03-28 09:43:44 |
Message-ID: | ZgU70MjdOfO6l0O0@ip-10-97-1-34.eu-west-3.compute.internal |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On Wed, Mar 27, 2024 at 09:00:37PM +0530, Bharath Rupireddy wrote:
> standby for sync slots. 0002 implementing inactive timeout GUC based
> invalidation mechanism.
>
> Please have a look.
Thanks!
Regarding 0002:
Some testing:
T1 ===
When the slot is invalidated on the primary, then the reason is propagated to
the sync slot (if any). That's fine but we are loosing the inactive_since on the
standby:
Primary:
postgres=# select slot_name,inactive_since,conflicting,invalidation_reason from pg_replication_slots where slot_name='lsub29_slot';
slot_name | inactive_since | conflicting | invalidation_reason
-------------+-------------------------------+-------------+---------------------
lsub29_slot | 2024-03-28 08:24:51.672528+00 | f | inactive_timeout
(1 row)
Standby:
postgres=# select slot_name,inactive_since,conflicting,invalidation_reason from pg_replication_slots where slot_name='lsub29_slot';
slot_name | inactive_since | conflicting | invalidation_reason
-------------+----------------+-------------+---------------------
lsub29_slot | | f | inactive_timeout
(1 row)
I think in this case it should always reflect the value from the primary (so
that one can understand why it is invalidated).
T2 ===
And it is set to a value during promotion:
postgres=# select pg_promote();
pg_promote
------------
t
(1 row)
postgres=# select slot_name,inactive_since,conflicting,invalidation_reason from pg_replication_slots where slot_name='lsub29_slot';
slot_name | inactive_since | conflicting | invalidation_reason
-------------+------------------------------+-------------+---------------------
lsub29_slot | 2024-03-28 08:30:11.74505+00 | f | inactive_timeout
(1 row)
I think when it is invalidated it should always reflect the value from the
primary (so that one can understand why it is invalidated).
T3 ===
As far the slot invalidation on the primary:
postgres=# SELECT * FROM pg_logical_slot_get_changes('lsub29_slot', NULL, NULL, 'include-xids', '0');
ERROR: cannot acquire invalidated replication slot "lsub29_slot"
Can we make the message more consistent with what can be found in CreateDecodingContext()
for example?
T4 ===
Also, it looks like querying pg_replication_slots() does not trigger an
invalidation: I think it should if the slot is not invalidated yet (and matches
the invalidation criteria).
Code review:
CR1 ===
+ Invalidate replication slots that are inactive for longer than this
+ amount of time. If this value is specified without units, it is taken
s/Invalidate/Invalidates/?
Should we mention the relationship with inactive_since?
CR2 ===
+ *
+ * If check_for_invalidation is true, the slot is checked for invalidation
+ * based on replication_slot_inactive_timeout GUC and an error is raised after making the slot ours.
*/
void
-ReplicationSlotAcquire(const char *name, bool nowait)
+ReplicationSlotAcquire(const char *name, bool nowait,
+ bool check_for_invalidation)
s/check_for_invalidation/check_for_timeout_invalidation/?
CR3 ===
+ if (slot->inactive_since == 0 ||
+ replication_slot_inactive_timeout == 0)
+ return false;
Better to test replication_slot_inactive_timeout first? (I mean there is no
point of testing inactive_since if replication_slot_inactive_timeout == 0)
CR4 ===
+ if (slot->inactive_since > 0 &&
+ replication_slot_inactive_timeout > 0)
+ {
Same.
So, instead of CR3 === and CR4 ===, I wonder if it wouldn't be better to do
something like:
if (replication_slot_inactive_timeout == 0)
return false;
else if (slot->inactive_since > 0)
.
.
.
.
else
return false;
That would avoid checking replication_slot_inactive_timeout and inactive_since
multiple times.
CR5 ===
+ * held to avoid race conditions -- for example the restart_lsn could move
+ * forward, or the slot could be dropped.
Does the restart_lsn example makes sense here?
CR6 ===
+static bool
+InvalidateSlotForInactiveTimeout(ReplicationSlot *slot, bool need_locks)
+{
InvalidatePossiblyInactiveSlot() maybe?
CR7 ===
+ /* Make sure the invalidated state persists across server restart */
+ slot->just_dirtied = true;
+ slot->dirty = true;
+ SpinLockRelease(&slot->mutex);
Maybe we could create a new function say MarkGivenReplicationSlotDirty()
with a slot as parameter, that ReplicationSlotMarkDirty could call too?
Then maybe we could set slot->data.invalidated = RS_INVAL_INACTIVE_TIMEOUT in
InvalidateSlotForInactiveTimeout()? (to avoid multiple SpinLockAcquire/SpinLockRelease).
CR8 ===
+ if (persist_state)
+ {
+ char path[MAXPGPATH];
+
+ sprintf(path, "pg_replslot/%s", NameStr(slot->data.name));
+ SaveSlotToPath(slot, path, ERROR);
+ }
Maybe we could create a new function say GivenReplicationSlotSave()
with a slot as parameter, that ReplicationSlotSave() could call too?
CR9 ===
+ if (check_for_invalidation)
+ {
+ /* The slot is ours by now */
+ Assert(s->active_pid == MyProcPid);
+
+ /*
+ * Well, the slot is not yet ours really unless we check for the
+ * invalidation below.
+ */
+ s->active_pid = 0;
+ if (InvalidateReplicationSlotForInactiveTimeout(s, true, true))
+ {
+ /*
+ * If the slot has been invalidated, recalculate the resource
+ * limits.
+ */
+ ReplicationSlotsComputeRequiredXmin(false);
+ ReplicationSlotsComputeRequiredLSN();
+
+ /* Might need it for slot clean up on error, so restore it */
+ s->active_pid = MyProcPid;
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot acquire invalidated replication slot \"%s\"",
+ NameStr(MyReplicationSlot->data.name))));
+ }
+ s->active_pid = MyProcPid;
Are we not missing some SpinLockAcquire/Release on the slot's mutex here? (the
places where we set the active_pid).
CR10 ===
@@ -1628,6 +1674,10 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
if (SlotIsLogical(s))
invalidation_cause = cause;
break;
+ case RS_INVAL_INACTIVE_TIMEOUT:
+ if (InvalidateReplicationSlotForInactiveTimeout(s, false, false))
+ invalidation_cause = cause;
+ break;
InvalidatePossiblyObsoleteSlot() is not called with such a reason, better to use
an Assert here and in the caller too?
CR11 ===
+++ b/src/test/recovery/t/050_invalidate_slots.pl
why not using 019_replslot_limit.pl?
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From | Date | Subject | |
---|---|---|---|
Next Message | Pavel Borisov | 2024-03-28 09:46:30 | Re: Table AM Interface Enhancements |
Previous Message | Jelte Fennema-Nio | 2024-03-28 09:41:50 | Re: Possibility to disable `ALTER SYSTEM` |