Re: Introduce XID age and inactive timeout based replication slot invalidation

From: vignesh C <vignesh21(at)gmail(dot)com>
To: Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>
Cc: Peter Smith <smithpb2250(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Introduce XID age and inactive timeout based replication slot invalidation
Date: 2024-11-28 14:10:32
Message-ID: CALDaNm1F2YrswzM_WM37BYmiZ9Cf60UD_mgtm8HnMHRGA7tx4g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 27 Nov 2024 at 16:25, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com> wrote:
>
> On Wed, Nov 27, 2024 at 8:39 AM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
> >
> > Hi Nisha,
> >
> > Here are some review comments for the patch v50-0002.
> >
> > ======
> > src/backend/replication/slot.c
> >
> > InvalidatePossiblyObsoleteSlot:
> >
> > 1.
> > + if (now &&
> > + TimestampDifferenceExceeds(s->inactive_since, now,
> > + replication_slot_inactive_timeout_sec * 1000))
> >
> > Previously this was using an additional call to SlotInactiveTimeoutCheckAllowed:
> >
> > + if (SlotInactiveTimeoutCheckAllowed(s) &&
> > + TimestampDifferenceExceeds(s->inactive_since, now,
> > + replication_slot_inactive_timeout * 1000))
> >
> > Is it OK to skip that call? e.g. can the slot fields possibly change
> > between assigning the 'now' and acquiring the mutex? If not, then the
> > current code is fine. The only reason for asking is because it is
> > slightly suspicious that it was not done this "easy" way in the first
> > place.
> >
> Good catch! While the mutex was being acquired right after the now
> assignment, there was a rare chance of another process modifying the
> slot in the meantime. So, I reverted the change in v51. To optimize
> the SlotInactiveTimeoutCheckAllowed() call, it's sufficient to check
> it here instead of during the 'now' assignment.
>
> Attached v51 patch-set addressing all comments in [1] and [2].

Few comments:
1) replication_slot_inactive_timeout can be mentioned in logical
replication config, we could mention something like:
Logical replication slot is also affected by replication_slot_inactive_timeout

2.a) Is this change applicable only for inactive timeout or it is
applicable to others like wal removed, wal level etc also? If it is
applicable to all of them we could move this to the first patch and
update the commit message:
+ * If the slot can be acquired, do so and mark it as
invalidated. If
+ * the slot is already ours, mark it as invalidated.
Otherwise, we'll
+ * signal the owning process below and retry.
*/
- if (active_pid == 0)
+ if (active_pid == 0 ||
+ (MyReplicationSlot == s &&
+ active_pid == MyProcPid))

2.b) Also this MyReplicationSlot and active_pid check can be in same line:
+ (MyReplicationSlot == s &&
+ active_pid == MyProcPid))

3) Error detail should start in upper case here similar to how others are done:
+ case RS_INVAL_INACTIVE_TIMEOUT:
+ appendStringInfo(&err_detail,
_("inactivity exceeded the time limit set by \"%s\"."),
+
"replication_slot_inactive_timeout");
+ break;

4) Since this change is not related to this patch, we can move this to
the first patch and update the commit message:
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -1508,7 +1508,7 @@ ReplSlotSyncWorkerMain(char *startup_data,
size_t startup_data_len)
static void
update_synced_slots_inactive_since(void)
{
- TimestampTz now = 0;
+ TimestampTz now;

/*
* We need to update inactive_since only when we are promoting
standby to
@@ -1523,6 +1523,9 @@ update_synced_slots_inactive_since(void)
/* The slot sync worker or SQL function mustn't be running by now */
Assert((SlotSyncCtx->pid == InvalidPid) && !SlotSyncCtx->syncing);

+ /* Use same inactive_since time for all slots */
+ now = GetCurrentTimestamp();

5) Since this change is not related to this patch, we can move this to
the first patch.
@@ -2250,6 +2350,7 @@ RestoreSlotFromDisk(const char *name)
bool restored = false;
int readBytes;
pg_crc32c checksum;
+ TimestampTz now;

/* no need to lock here, no concurrent access allowed yet */

@@ -2410,6 +2511,9 @@ RestoreSlotFromDisk(const char *name)
NameStr(cp.slotdata.name)),
errhint("Change \"wal_level\" to be
\"replica\" or higher.")));

+ /* Use same inactive_since time for all slots */
+ now = GetCurrentTimestamp();
+
/* nothing can be active yet, don't lock anything */
for (i = 0; i < max_replication_slots; i++)
{
@@ -2440,9 +2544,11 @@ RestoreSlotFromDisk(const char *name)
/*
* Set the time since the slot has become inactive
after loading the
* slot from the disk into memory. Whoever acquires
the slot i.e.
- * makes the slot active will reset it.
+ * makes the slot active will reset it. Avoid calling
+ * ReplicationSlotSetInactiveSince() here, as it will
not set the time
+ * for invalid slots.
*/
- slot->inactive_since = GetCurrentTimestamp();
+ slot->inactive_since = now;

[1] - https://www.postgresql.org/docs/current/logical-replication-config.html

Regards,
Vignesh

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2024-11-28 14:14:12 Re: Fix comment in reorderbuffer.h
Previous Message Dean Rasheed 2024-11-28 13:56:10 Re: Added prosupport function for estimating numeric generate_series rows