Re: Introduce XID age and inactive timeout based replication slot invalidation

From: Peter Smith <smithpb2250(at)gmail(dot)com>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Introduce XID age and inactive timeout based replication slot invalidation
Date: 2024-09-03 06:55:58
Message-ID: CAHut+PtaLA6v9bzmH=g51HhZNNJ5ixvMiokosdnrSFpU51arEg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi, my previous review posts did not cover the test code.

Here are my review comments for the v44-0001 test code

======
TEST CASE #1

1.
+# Wait for the inactive replication slot to be invalidated.
+$standby1->poll_query_until(
+ 'postgres', qq[
+ SELECT COUNT(slot_name) = 1 FROM pg_replication_slots
+ WHERE slot_name = 'lsub1_sync_slot' AND
+ invalidation_reason = 'inactive_timeout';
+])
+ or die
+ "Timed out while waiting for lsub1_sync_slot invalidation to be
synced on standby";
+

Is that comment correct? IIUC the synced slot should *already* be
invalidated from the primary, so here we are not really "waiting" for
it to be invalidated; Instead, we are just "confirming" that the
synchronized slot is already invalidated with the correct reason as
expected.

~~~

2.
+# Synced slot mustn't get invalidated on the standby even after a checkpoint,
+# it must sync invalidation from the primary. So, we must not see the slot's
+# invalidation message in server log.
+$standby1->safe_psql('postgres', "CHECKPOINT");
+ok( !$standby1->log_contains(
+ "invalidating obsolete replication slot \"lsub1_sync_slot\"",
+ $standby1_logstart),
+ 'check that syned lsub1_sync_slot has not been invalidated on the standby'
+);
+

This test case seemed bogus, for a couple of reasons:

2a. IIUC this 'lsub1_sync_slot' is the same one that is already
invalid (from the primary), so nobody should be surprised that an
already invalid slot doesn't get flagged as invalid again. i.e.
Shouldn't your test scenario here be done using a valid synced slot?

2b. AFAICT it was only moments above this CHECKPOINT where you
assigned the standby inactivity timeout to 2s. So even if there was
some bug invalidating synced slots I don't think you gave it enough
time to happen -- e.g. I doubt 2s has elapsed yet.

~

3.
+# Stop standby to make the standby's replication slot on the primary inactive
+$standby1->stop;
+
+# Wait for the standby's replication slot to become inactive
+wait_for_slot_invalidation($primary, 'sb1_slot', $logstart,
+ $inactive_timeout);

This seems a bit tricky. Both these (the stop and the wait) seem to
belong together, so I think maybe a single bigger explanatory comment
covering both parts would help for understanding.

======
TEST CASE #2

4.
+# Stop subscriber to make the replication slot on publisher inactive
+$subscriber->stop;
+
+# Wait for the replication slot to become inactive and then invalidated due to
+# timeout.
+wait_for_slot_invalidation($publisher, 'lsub1_slot', $logstart,
+ $inactive_timeout);

IIUC, this is just like comment #3 above. Both these (the stop and the
wait) seem to belong together, so I think maybe a single bigger
explanatory comment covering both parts would help for understanding.

~~~

5.
+# Testcase end: Invalidate logical subscriber's slot due to
+# replication_slot_inactive_timeout.
+# =============================================================================

IMO the rest of the comment after "Testcase end" isn't very useful.

======
sub wait_for_slot_invalidation

6.
+sub wait_for_slot_invalidation
+{

An explanatory header comment for this subroutine would be helpful.

~~~

7.
+ # Wait for the replication slot to become inactive
+ $node->poll_query_until(
+ 'postgres', qq[
+ SELECT COUNT(slot_name) = 1 FROM pg_replication_slots
+ WHERE slot_name = '$slot_name' AND active = 'f';
+ ])
+ or die
+ "Timed out while waiting for slot $slot_name to become inactive on
node $name";
+
+ # Wait for the replication slot info to be updated
+ $node->poll_query_until(
+ 'postgres', qq[
+ SELECT COUNT(slot_name) = 1 FROM pg_replication_slots
+ WHERE inactive_since IS NOT NULL
+ AND slot_name = '$slot_name' AND active = 'f';
+ ])
+ or die
+ "Timed out while waiting for info of slot $slot_name to be updated
on node $name";
+

Why are there are 2 separate poll_query_until's here? Can't those be
combined into just one?

~~~

8.
+ # Sleep at least $inactive_timeout duration to avoid multiple checkpoints
+ # for the slot to get invalidated.
+ sleep($inactive_timeout);
+

Maybe this special sleep to prevent too many CHECKPOINTs should be
moved to be inside the other subroutine, which is actually doing those
CHECKPOINTs.

~~~

9.
+ # Wait for the inactive replication slot to be invalidated
+ $node->poll_query_until(
+ 'postgres', qq[
+ SELECT COUNT(slot_name) = 1 FROM pg_replication_slots
+ WHERE slot_name = '$slot_name' AND
+ invalidation_reason = 'inactive_timeout';
+ ])
+ or die
+ "Timed out while waiting for inactive slot $slot_name to be
invalidated on node $name";
+

The comment seems misleading. IIUC you are not "waiting" for the
invalidation here, because it is the other subroutine doing the
waiting for the invalidation message in the logs. Instead, here I
think you are just confirming the 'invalidation_reason' got set
correctly. The comment should say what it is really doing.

======
sub check_for_slot_invalidation_in_server_log

10.
+# Check for invalidation of slot in server log
+sub check_for_slot_invalidation_in_server_log
+{

I think the main function of this subroutine is the CHECKPOINT and the
waiting for the server log to say invalidation happened. It is doing a
loop of a) CHECKPOINT then b) inspecting the server log for the slot
invalidation, and c) waiting for a bit. Repeat 10 times.

A comment describing the logic for this subroutine would be helpful.

The most important side-effect of this function is the CHECKPOINT
because without that nothing will ever get invalidated due to
inactivity, but this key point is not obvious from the subroutine
name.

IMO it would be better to name this differently to reflect what it is
really doing:
e.g. "CHECKPOINT_and_wait_for_slot_invalidation_in_server_log"

======
Kind Regards,
Peter Smith.
Fujitsu Australia

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2024-09-03 07:07:58 Re: per backend I/O statistics
Previous Message Kyotaro Horiguchi 2024-09-03 06:37:49 Re: per backend I/O statistics