Re: Introduce XID age and inactive timeout based replication slot invalidation

From: vignesh C <vignesh21(at)gmail(dot)com>
To: Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>
Cc: shveta malik <shveta(dot)malik(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Introduce XID age and inactive timeout based replication slot invalidation
Date: 2024-11-14 03:44:40
Message-ID: CALDaNm2VQW_gpOJ-QWkEA_h18DN31ELEz2_7QmwWCAg9=Zew4A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 13 Nov 2024 at 15:00, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com> wrote:
>
> Please find the v48 patch attached.
>
> On Thu, Sep 19, 2024 at 9:40 AM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
> >
> > When we promote hot standby with synced logical slots to become new
> > primary, the logical slots are never invalidated with
> > 'inactive_timeout' on new primary. It seems the check in
> > SlotInactiveTimeoutCheckAllowed() is wrong. We should allow
> > invalidation of slots on primary even if they are marked as 'synced'.
>
> fixed.
>
> > I have raised 4 issues so far on v46, the first 3 are in [1],[2],[3].
> > Once all these are addressed, I can continue reviewing further.
> >
>
> Fixed issues reported in [1], [2].

Few comments:
1) Since we don't change the value of now in
ReplicationSlotSetInactiveSince, the function parameter can be passed
by value:
+/*
+ * Set slot's inactive_since property unless it was previously invalidated.
+ */
+static inline void
+ReplicationSlotSetInactiveSince(ReplicationSlot *s, TimestampTz *now,
+ bool
acquire_lock)
+{
+ if (s->data.invalidated != RS_INVAL_NONE)
+ return;
+
+ if (acquire_lock)
+ SpinLockAcquire(&s->mutex);
+
+ s->inactive_since = *now;

2) Currently it allows a minimum value of less than 1 second like in
milliseconds, I feel we can have some minimum value at least something
like checkpoint_timeout:
diff --git a/src/backend/utils/misc/guc_tables.c
b/src/backend/utils/misc/guc_tables.c
index 8a67f01200..367f510118 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3028,6 +3028,18 @@ struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},

+ {
+ {"replication_slot_inactive_timeout", PGC_SIGHUP,
REPLICATION_SENDING,
+ gettext_noop("Sets the amount of time a
replication slot can remain inactive before "
+ "it will be invalidated."),
+ NULL,
+ GUC_UNIT_S
+ },
+ &replication_slot_inactive_timeout,
+ 0, 0, INT_MAX,
+ NULL, NULL, NULL
+ },

3) Since SlotInactiveTimeoutCheckAllowed check is just done above and
the current time has been retrieved can we used "now" variable instead
of SlotInactiveTimeoutCheckAllowed again second time:
@@ -1651,6 +1713,26 @@
InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
if (SlotIsLogical(s))
invalidation_cause = cause;
break;
+ case RS_INVAL_INACTIVE_TIMEOUT:
+
+ /*
+ * Check if the slot needs to
be invalidated due to
+ *
replication_slot_inactive_timeout GUC.
+ */
+ if
(SlotInactiveTimeoutCheckAllowed(s) &&
+
TimestampDifferenceExceeds(s->inactive_since, now,
+
replication_slot_inactive_timeout * 1000))
+ {
+ invalidation_cause = cause;
+ inactive_since =
s->inactive_since;

4) I'm not sure if this change required by this patch or is it a
general optimization, if it is required for this patch we can detail
the comments:
@@ -2208,6 +2328,7 @@ RestoreSlotFromDisk(const char *name)
bool restored = false;
int readBytes;
pg_crc32c checksum;
+ TimestampTz now;

/* no need to lock here, no concurrent access allowed yet */

@@ -2368,6 +2489,9 @@ RestoreSlotFromDisk(const char *name)
NameStr(cp.slotdata.name)),
errhint("Change \"wal_level\" to be
\"replica\" or higher.")));

+ /* Use same inactive_since time for all slots */
+ now = GetCurrentTimestamp();
+
/* nothing can be active yet, don't lock anything */
for (i = 0; i < max_replication_slots; i++)
{
@@ -2400,7 +2524,7 @@ RestoreSlotFromDisk(const char *name)
* slot from the disk into memory. Whoever acquires
the slot i.e.
* makes the slot active will reset it.
*/
- slot->inactive_since = GetCurrentTimestamp();
+ slot->inactive_since = now;

5) Why should the slot invalidation be updated during shutdown,
shouldn't the inactive_since value be intact during shutdown?
- <literal>NULL</literal> if the slot is currently being used.
- Note that for slots on the standby that are being synced from a
+ <literal>NULL</literal> if the slot is currently being used. Once the
+ slot is invalidated, this value will remain unchanged until we shutdown
+ the server. Note that for slots on the standby that are being
synced from a

6) New Style of ereport does not need braces around errcode, it can be
changed similarly:
+ if (error_if_invalid &&
+ s->data.invalidated == RS_INVAL_INACTIVE_TIMEOUT)
+ {
+ Assert(s->inactive_since > 0);
+ ereport(ERROR,
+
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("can no longer get changes
from replication slot \"%s\"",
+ NameStr(s->data.name)),
+ errdetail("This slot has been
invalidated because it was inactive for longer than the amount of time
specified by \"%s\".",
+
"replication_slot_inactive_timeout")));

Regards,
Vignesh

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2024-11-14 04:22:57 Re: Skip collecting decoded changes of already-aborted transactions
Previous Message Hayato Kuroda (Fujitsu) 2024-11-14 02:56:45 RE: Fix for pageinspect bug in PG 17