Re: Introduce XID age and inactive timeout based replication slot invalidation

From: Peter Smith <smithpb2250(at)gmail(dot)com>
To: Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>
Cc: vignesh C <vignesh21(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Introduce XID age and inactive timeout based replication slot invalidation
Date: 2024-12-11 02:44:05
Message-ID: CAHut+Pvx294U-XBB6-BvabesUNxbnuDQmk-VOFm=pbcNWSsHvQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Nisha.

Here are some review comments for patch v54-0002.

(I had also checked patch v54-0001, but have no further review
comments for that one).

======
doc/src/sgml/config.sgml

1.
+ <para>
+ Slot invalidation due to idle timeout occurs during checkpoint.
+ If the <varname>checkpoint_timeout</varname> exceeds
+ <varname>idle_replication_slot_timeout</varname>, the slot
+ invalidation will be delayed until the next checkpoint is triggered.
+ To avoid delays, users can force a checkpoint to promptly invalidate
+ inactive slots. The duration of slot inactivity is calculated
using the slot's
+ <link linkend="view-pg-replication-slots">pg_replication_slots</link>.<structfield>inactive_since</structfield>
+ value.
+ </para>
+

The wording of "If the checkpoint_timeout exceeds
idle_replication_slot_timeout, the slot invalidation will be delayed
until the next checkpoint is triggered." seems slightly misleading,
because AFAIK it is not conditional on the GUC value differences like
that -- i.e. slot invalidation is *always* delayed until the next
checkpoint occurs.

SUGGESTION:
Slot invalidation due to idle timeout occurs during checkpoint.
Because checkpoints happen at checkpoint_timeout intervals, there can
be some lag between when the idle_replication_slot_timeout was
exceeded and when the slot invalidation is triggered at the next
checkpoint. To avoid such lags, users can force...

=======
src/backend/replication/slot.c

2. GENERAL

+/* Invalidate replication slots idle beyond this time; '0' disables it */
+int idle_replication_slot_timeout_ms = 0;

I noticed this patch is using a variety of ways of describing the same thing:
* guc var: Invalidate replication slots idle beyond this time...
* guc_tables: ... the amount of time a replication slot can remain
idle before it will be invalidated.
* docs: means that the slot has remained idle beyond the duration
specified by the idle_replication_slot_timeout parameter
* errmsg: ... slot has been invalidated because inactivity exceeded
the time limit set by ...
* etc..

They are all the same, but they are all worded slightly differently:
* "idle" vs "inactivity" vs ...
* "time" vs "amount of time" vs "duration" vs "time limit" vs ...

There may not be a one-size-fits-all, but still, it might be better to
try to search for all different phrasing and use common wording as
much as possible.

~~~

CheckPointReplicationSlots:

3.
+ * XXX: Slot invalidation due to 'idle_timeout' occurs only for
+ * released slots, based on 'idle_replication_slot_timeout'. Active
+ * slots in use for replication are excluded, preventing accidental
+ * invalidation. Slots where communication between the publisher and
+ * subscriber is down are also excluded, as they are managed by the
+ * 'wal_sender_timeout'.

Maybe a slight rewording like below is better. Maybe not. YMMV.

SUGGESTION:
XXX: Slot invalidation due to 'idle_timeout' applies only to released
slots, and is based on the 'idle_replication_slot_timeout' GUC. Active
slots
currently in use for replication are excluded to prevent accidental
invalidation. Slots...

======
src/bin/pg_upgrade/server.c

4.
+ /*
+ * Use idle_replication_slot_timeout=0 to prevent slot invalidation due to
+ * inactive_timeout by checkpointer process during upgrade.
+ */
+ if (GET_MAJOR_VERSION(cluster->major_version) >= 1800)
+ appendPQExpBufferStr(&pgoptions, " -c idle_replication_slot_timeout=0");
+

/inactive_timeout/idle_timeout/

======
src/test/recovery/t/043_invalidate_inactive_slots.pl

5.
+# Wait for slot to first become idle and then get invalidated
+sub wait_for_slot_invalidation
+{
+ my ($node, $slot, $offset, $idle_timeout) = @_;
+ my $node_name = $node->name;

AFAICT this 'idle_timeout' parameter is passed units of "seconds", so
it would be better to call it something like 'idle_timeout_s' to make
the units clear.

~~~

6.
+# Trigger slot invalidation and confirm it in the server log
+sub trigger_slot_invalidation
+{
+ my ($node, $slot, $offset, $idle_timeout) = @_;
+ my $node_name = $node->name;
+ my $invalidated = 0;

Ditto above review comment #5 -- better to call it something like
'idle_timeout_s' to make the units clear.

======
Kind Regards,
Peter Smith.
Fujitsu Australia

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2024-12-11 02:51:27 Re: Skip collecting decoded changes of already-aborted transactions
Previous Message Hayato Kuroda (Fujitsu) 2024-12-11 02:34:29 RE: Parallel heap vacuum