Re: Introduce XID age and inactive timeout based replication slot invalidation

From: Peter Smith <smithpb2250(at)gmail(dot)com>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Introduce XID age and inactive timeout based replication slot invalidation
Date: 2024-08-30 02:43:10
Message-ID: CAHut+PuFzCHPCiZbpoQX59kgZbebuWT0gR0O7rOe4t_sdYu=OA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi, here are some review comments for patch v43-0001.

======
Commit message

1.
... introduces a GUC allowing users set inactive timeout.

~

1a. You should give the name of the new GUC in the commit message.

1b. /set/to set/

======
doc/src/sgml/config.sgml

GUC "replication_slot_inactive_timeout"

2.
Invalidates replication slots that are inactive for longer than
specified amount of time

nit - suggest use similar wording as the prior GUC (wal_sender_timeout):
Invalidate replication slots that are inactive for longer than this
amount of time.

~

3.
This invalidation check happens either when the slot is acquired for
use or during a checkpoint. The time since the slot has become
inactive is known from its inactive_since value using which the
timeout is measured.

nit - the wording is too complicated. suggest:
The timeout check occurs when the slot is next acquired for use, or
during a checkpoint. The slot's 'inactive_since' field value is when
the slot became inactive.

~

4.
Note that the inactive timeout invalidation mechanism is not
applicable for slots on the standby that are being synced from a
primary server (whose synced field is true).

nit - that word "whose" seems ambiguous. suggest:
(e.g. the standby slot has 'synced' field true).

======
doc/src/sgml/system-views.sgml

5.
inactive_timeout means that the slot has been inactive for the
duration specified by replication_slot_inactive_timeout parameter.

nit - suggestion ("longer than"):
... the slot has been inactive for longer than the duration specified
by the replication_slot_inactive_timeout parameter.

======
src/backend/replication/slot.c

6.
/* Maximum number of invalidation causes */
-#define RS_INVAL_MAX_CAUSES RS_INVAL_WAL_LEVEL
+#define RS_INVAL_MAX_CAUSES RS_INVAL_INACTIVE_TIMEOUT

IMO this #define belongs in the slot.h, immediately below where the
enum is defined.

~~~

7. ReplicationSlotAcquire:

I had a fundamental question about this logic.

IIUC the purpose of the patch was to invalidate replication slots that
have been inactive for too long.

So, it makes sense to me that some periodic processing (e.g.
CheckPointReplicationSlots) might do a sweep over all the slots, and
invalidate the too-long-inactive ones that it finds.

OTOH, it seemed quite strange to me that the patch logic is also
detecting and invalidating inactive slots during the
ReplicationSlotAcquire function. This is kind of saying "ERROR -
sorry, because this was inactive for too long you can't have it" at
the very moment that you wanted to use it again! IIUC such a slot
would be invalidated by the function InvalidatePossiblyObsoleteSlot(),
but therein lies my doubt -- how can the slot be considered as
"obsolete" when we are in the very act of trying to acquire/use it?

I guess it might be argued this is not so different to the scenario of
attempting to acquire a slot that had been invalidated momentarily
before during checkpoint processing. But, somehow that scenario seems
more like bad luck to me, versus ReplicationSlotAcquire() deliberately
invalidating something we *know* is wanted.

~

8.
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("can no longer get changes from replication slot \"%s\"",
+ NameStr(s->data.name)),
+ errdetail("This slot has been invalidated because it was inactive
since %s for more than %d seconds specified by
\"replication_slot_inactive_timeout\".",
+ timestamptz_to_str(s->inactive_since),
+ replication_slot_inactive_timeout)));

nit - IMO the info should be split into errdetail + errhint. Like this:
errdetail("The slot became invalid because it was inactive since %s,
which is more than %d seconds ago."...)
errhint("You might need to increase \"%s\".",
"replication_slot_inactive_timeout")

~~~

9. ReportSlotInvalidation

+ appendStringInfo(&err_detail,
+ _("The slot has been inactive since %s for more than %d seconds
specified by \"replication_slot_inactive_timeout\"."),
+ timestamptz_to_str(inactive_since),
+ replication_slot_inactive_timeout);
+ break;

IMO this error in ReportSlotInvalidation() should be the same as the
other one from ReplicationSlotAcquire(), which I suggested above
(comment #8) should include a hint. Also, including a hint here will
make this new message consistent with the other errhint (for
"max_slot_wal_keep_size") that is already in this function.

~~~

10. InvalidatePossiblyObsoleteSlot

+ if (cause == RS_INVAL_INACTIVE_TIMEOUT &&
+ (replication_slot_inactive_timeout > 0 &&
+ s->inactive_since > 0 &&
+ !(RecoveryInProgress() && s->data.synced)))

10a. Everything here is && so this has some redundant parentheses.

10b. Actually, IMO this complicated condition is overkill. Won't it be
better to just unconditionally assign
now = GetCurrentTimestamp(); here?

~

11.
+ * Note that we don't invalidate synced slots because,
+ * they are typically considered not active as they don't
+ * perform logical decoding to produce the changes.

nit - tweaked punctuation

~

12.
+ * If the slot can be acquired, do so or if the slot is already ours,
+ * then mark it invalidated. Otherwise we'll signal the owning
+ * process, below, and retry.

nit - tidied this comment. Suggestion:
If the slot can be acquired, do so and mark it as invalidated. If the
slot is already ours, mark it as invalidated. Otherwise, we'll signal
the owning process below and retry.

~

13.
+ if (active_pid == 0 ||
+ (MyReplicationSlot != NULL &&
+ MyReplicationSlot == s &&
+ active_pid == MyProcPid))

You are already checking MyReplicationSlot == s here, so that extra
check for MyReplicationSlot != NULL is redundant, isn't it?

~~~

14. CheckPointReplicationSlots

/*
- * Flush all replication slots to disk.
+ * Flush all replication slots to disk. Also, invalidate slots during
+ * non-shutdown checkpoint.
*
* It is convenient to flush dirty replication slots at the time of checkpoint.
* Additionally, in case of a shutdown checkpoint, we also identify the slots

nit - /Also, invalidate slots/Also, invalidate obsolete slots/

======
src/backend/utils/misc/guc_tables.c

15.
+ {"replication_slot_inactive_timeout", PGC_SIGHUP, REPLICATION_SENDING,
+ gettext_noop("Sets the amount of time to wait before invalidating an "
+ "inactive replication slot."),

nit - that is maybe a bit misleading because IIUC there is no real
"waiting" happening anywhere. Suggest:
Sets the amount of time a replication slot can remain inactive before
it will be invalidated.

======

Please take a look at the attached top-up patches. These include
changes for many of the nits above.

======
Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment Content-Type Size
PS_NITPICKS_20240830_CODE_V430001.txt text/plain 3.0 KB
PS_NITPICKS_20240829_DOCS_v430001.txt text/plain 2.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Xing Guo 2024-08-30 03:55:17 JIT: Remove some unnecessary instructions.
Previous Message Andy Fan 2024-08-30 01:34:34 Re: Make printtup a bit faster