Re: Introduce XID age and inactive timeout based replication slot invalidation

From: Peter Smith <smithpb2250(at)gmail(dot)com>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Introduce XID age and inactive timeout based replication slot invalidation
Date: 2024-09-17 01:27:24
Message-ID: CAHut+Ps=x+2Hq5ue0YppOeDZqgHTnyw=u+vs-qy0JRjKaeJtew@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Here are a few comments for the patch v46-0001.

======
src/backend/replication/slot.c

1. ReportSlotInvalidation

On Mon, Sep 16, 2024 at 8:01 PM Bharath Rupireddy
<bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:
>
> On Mon, Sep 9, 2024 at 1:11 PM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
> > 3. ReportSlotInvalidation
> >
> > I didn't understand why there was a hint for:
> > "You might need to increase \"%s\".", "max_slot_wal_keep_size"
> >
> > Why aren't these similar cases consistent?
>
> It looks misleading and not very useful. What happens if the removed
> WAL (that's needed for the slot) is put back into pg_wal somehow (by
> manually copying from archive or by some tool/script)? Can the slot
> invalidated due to wal_removed start sending WAL to its clients?
>
> > But you don't have an equivalent hint for timeout invalidation:
> > "You might need to increase \"%s\".", "replication_slot_inactive_timeout"
>
> I removed this per review comments upthread.

IIUC the errors are quite similar, so my previous review comment was
mostly about the unexpected inconsistency of why one of them has a
hint and the other one does not. I don't have a strong opinion about
whether they should both *have* or *not have* hints, so long as they
are treated the same.

If you think the current code hint is not useful then maybe we need a
new thread to address that existing issue. For example, maybe it
should be removed or reworded.

~~~

2. InvalidatePossiblyObsoleteSlot:

+ case RS_INVAL_INACTIVE_TIMEOUT:
+
+ if (!SlotInactiveTimeoutCheckAllowed(s))
+ break;
+
+ /*
+ * Check if the slot needs to be invalidated due to
+ * replication_slot_inactive_timeout GUC.
+ */
+ if (TimestampDifferenceExceeds(s->inactive_since, now,
+ replication_slot_inactive_timeout * 1000))

nit - it might be tidier to avoid multiple breaks by just combining
these conditions. See the nitpick attachment.

~~~

3.
* - RS_INVAL_WAL_LEVEL: is logical
+ * - RS_INVAL_INACTIVE_TIMEOUT: inactive timeout occurs

nit - use comment wording "inactive slot timeout has occurred", to
make it identical to the comment in slot.h

======
src/test/recovery/t/050_invalidate_slots.pl

4.
+# Despite inactive timeout being set, the synced slot won't get invalidated on
+# its own on the standby. So, we must not see invalidation message in server
+# log.
+$standby1->safe_psql('postgres', "CHECKPOINT");
+ok( !$standby1->log_contains(
+ "invalidating obsolete replication slot \"sync_slot1\"",
+ $logstart),
+ 'check that synced slot sync_slot1 has not been invalidated on standby'
+);
+

It seems kind of brittle to check the logs for something that is NOT
there because any change to the message will make this accidentally
pass. Apart from that, it might anyway be more efficient just to check
the pg_replication_slots again to make sure the 'invalidation_reason
remains' still NULL.

======

Please see the attachment which implements some of the nit changes
mentioned above.

======
Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment Content-Type Size
PS_NITPICKS_20240917_SLOT_TIMEOUT_v46.txt text/plain 1.8 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2024-09-17 02:40:04 Re: Add memory/disk usage for WindowAgg nodes in EXPLAIN
Previous Message Thomas Munro 2024-09-17 01:01:17 Re: Robocopy might be not robust enough for never-ending testing on Windows