Re: Introduce XID age and inactive timeout based replication slot invalidation

From: shveta malik <shveta(dot)malik(at)gmail(dot)com>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: Peter Smith <smithpb2250(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject: Re: Introduce XID age and inactive timeout based replication slot invalidation
Date: 2024-09-09 05:23:50
Message-ID: CAJpy0uAczcmiuFcZM_K+k2vK_Hcmrww73fUtHRrW3nvT9WABLg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Sep 8, 2024 at 5:25 PM Bharath Rupireddy
<bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:
>
>
> Please find the v45 patch. Addressed above and Shveta's review comments [1].
>

Thanks for the patch. Please find my comments:

1)
src/sgml/config.sgml:

+ Synced slots are always considered to be inactive because they
don't perform logical decoding to produce changes.

It is better we avoid such a statement, as internally we use logical
decoding to advance restart-lsn, see
'LogicalSlotAdvanceAndCheckSnapState' called form slotsync.c.
<Also see related comment 6 below>

2)
src/sgml/config.sgml:

+ disables the inactive timeout invalidation mechanism

+ Slot invalidation due to inactivity timeout occurs during checkpoint.

Either have 'inactive' at both the places or 'inactivity'.

3)
slot.c:
+static bool InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause
cause,
+ ReplicationSlot *s,
+ XLogRecPtr oldestLSN,
+ Oid dboid,
+ TransactionId snapshotConflictHorizon,
+ bool *invalidated);
+static inline bool SlotInactiveTimeoutCheckAllowed(ReplicationSlot *s);

I think, we do not need above 2 declarations. The code compile fine
without these as the usage is later than the definition.

4)
+ /*
+ * An error is raised if error_if_invalid is true and the slot has been
+ * invalidated previously.
+ */
+ if (error_if_invalid && s->data.invalidated == RS_INVAL_INACTIVE_TIMEOUT)

The comment is generic while the 'if condition' is specific to one
invalidation cause. Even though I feel it can be made generic test for
all invalidation causes but that is not under scope of this thread and
needs more testing/analysis. For the time being, we can make comment
specific to the concerned invalidation cause. The header of function
will also need the same change.

5)
SlotInactiveTimeoutCheckAllowed():

+ * Check if inactive timeout invalidation mechanism is disabled or slot is
+ * currently being used or server is in recovery mode or slot on standby is
+ * currently being synced from the primary.
+ *

These comments say exact opposite of what we are checking in code.
Since the function name has 'Allowed' in it, we should be putting
comments which say what allows it instead of what disallows it.

6)

+ * Synced slots are always considered to be inactive because they don't
+ * perform logical decoding to produce changes.
+ */
+static inline bool
+SlotInactiveTimeoutCheckAllowed(ReplicationSlot *s)

Perhaps we should avoid mentioning logical decoding here. When slots
are synced, they are performing decoding and their inactive_since is
changing continuously. A better way to make this statement will be:

We want to ensure that the slots being synchronized are not
invalidated, as they need to be preserved for future use when the
standby server is promoted to the primary. This is necessary for
resuming logical replication from the new primary server.
<Rephrase if needed>

7)

InvalidatePossiblyObsoleteSlot()

we are calling SlotInactiveTimeoutCheckAllowed() twice in this
function. We shall optimize.

At the first usage place, shall we simply get timestamp when cause is
RS_INVAL_INACTIVE_TIMEOUT without checking
SlotInactiveTimeoutCheckAllowed() as IMO it does not seem a
performance critical section. Or if we retain check at first place,
then at the second place we can avoid calling it again based on
whether 'now' is NULL or not.

thanks
Shveta

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Nitin Motiani 2024-09-09 05:25:32 Re: race condition in pg_class
Previous Message Shlok Kyal 2024-09-09 05:21:43 Re: long-standing data loss bug in initial sync of logical replication