Re: Introduce XID age and inactive timeout based replication slot invalidation

From: Peter Smith <smithpb2250(at)gmail(dot)com>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Introduce XID age and inactive timeout based replication slot invalidation
Date: 2024-09-10 01:34:24
Message-ID: CAHut+PsHmfuDQt_gN2udjb9Ln00EvWw-eiRag5prB7JOgn9eEA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi, here is the remainder of my v45-0001 review. These comments are
for the test code only.

======
Testcase #1

1.
+# Testcase start
+#
+# Invalidate streaming standby slot and logical failover slot on primary due to
+# inactive timeout. Also, check logical failover slot synced to standby from
+# primary doesn't invalidate on its own, but gets the invalidated
state from the
+# primary.

nit - s/primary/the primary/ (in a couple of places)
nit - s/standby/the standby/
nit - other trivial tweaks.

~~~

2.
+# Create sync slot on primary
+$primary->psql('postgres',
+ q{SELECT pg_create_logical_replication_slot('sync_slot1',
'test_decoding', false, false, true);}
+);

nit - s/primary/the primary/

~~~

3.
+$primary->safe_psql(
+ 'postgres', qq[
+ SELECT pg_create_physical_replication_slot(slot_name :=
'sb_slot1', immediately_reserve := true);
+]);

Should this have a comment?

~~~

4.
+# Wait until standby has replayed enough data
+$primary->wait_for_catchup($standby1);

nit - s/standby/the standby/

~~~

5.
+# Sync primary slot to standby
+$standby1->safe_psql('postgres', "SELECT pg_sync_replication_slots();");

nit - /Sync primary slot to standby/Sync the primary slots to the standby/

~~~

6.
+# Confirm that logical failover slot is created on standby

nit - s/Confirm that logical failover slot is created on
standby/Confirm that the logical failover slot is created on the
standby/

~~~

7.
+is( $standby1->safe_psql(
+ 'postgres',
+ q{SELECT count(*) = 1 FROM pg_replication_slots
+ WHERE slot_name = 'sync_slot1' AND synced AND NOT temporary;}
+ ),
+ "t",
+ 'logical slot sync_slot1 has synced as true on standby');

IMO here you should also be checking that the sync slot state is NOT
invalidated, just as a counterpoint for the test part later that
checks that it IS invalidated.

~~~

8.
+my $inactive_timeout = 1;
+
+# Set timeout so that next checkpoint will invalidate inactive slot
+$primary->safe_psql(
+ 'postgres', qq[
+ ALTER SYSTEM SET replication_slot_inactive_timeout TO
'${inactive_timeout}s';
+]);
+$primary->reload;

8a.
nit - I think that $inactive_timeout assignment belongs below your comment.

~

8b.
nit - s/Set timeout so that next checkpoint will invalidate inactive
slot/Set timeout GUC so that the next checkpoint will invalidate
inactive slots/

~~~

9.
+# Check for logical failover slot to become inactive on primary. Note that
+# nobody has acquired slot yet, so it must get invalidated due to
+# inactive timeout.

nit - /Check for logical failover slot to become inactive on
primary./Wait for logical failover slot to become inactive on the
primary./
nit - /has acquired slot/has acquired the slot/

~~~

10.
+# Sync primary slot to standby. Note that primary slot has already been
+# invalidated due to inactive timeout. Standby must just sync inavalidated
+# state.

nit - minor, add "the". fix typo "inavalidated", etc. suggestion:

Re-sync the primary slots to the standby. Note that the primary slot was already
invalidated (above) due to inactive timeout. The standby must just
sync the invalidated
state.

~~~

11.
+# Make standby slot on primary inactive and check for invalidation
+$standby1->stop;

nit - /standby slot/the standby slot/
nit - /on primary/on the primary/

======
Testcase #2

12.
I'm not sure it is necessary to do all this extra work. IIUC, there
was already almost everything you needed in the previous Testcase #1.
So, I thought you could just combine this extra standby timeout test
in Testcase #1.

Indeed, your Testcase #1 comment still says it is doing this: ("Also,
check logical failover slot synced to standby from primary doesn't
invalidate on its own,...")

e.g.
- NEW: set the GUC timeout on the standby
- sync the sync_slot (already doing in test #1)
- ensure the synced slot is NOT invalid (already suggested above for test #1)
- NEW: then do a standby sleep > timeout duration
- NEW: then do a standby CHECKPOINT...
- NEW: then ensure the sync slot invalidation did NOT happen
- then proceed with the rest of test #1...

======
Testcase #3

13.
nit - remove a few blank lines to group associated statements together.

~~~

14.
+$publisher->safe_psql(
+ 'postgres', qq[
+ ALTER SYSTEM SET replication_slot_inactive_timeout TO '
${inactive_timeout}s';
+]);
+$publisher->reload;

nit - this deserves a comment, the same as in Testcase #1

======
sub wait_for_slot_invalidation

15.
+# Check for slot to first become inactive and then get invalidated
+sub check_for_slot_invalidation

nit - IMO the previous name was better (e.g. "wait_for.." instead of
"check_for...") because that describes exactly what the subroutine is
doing.

suggestion:
# Wait for the slot to first become inactive and then get invalidated
sub wait_for_slot_invalidation

~~~

16.
+{
+ my ($node, $slot, $offset, $inactive_timeout) = @_;
+ my $name = $node->name;

The variable $name seems too vague. How about $node_name?

~~~

17.
+ # Wait for invalidation reason to be set
+ $node->poll_query_until(
+ 'postgres', qq[
+ SELECT COUNT(slot_name) = 1 FROM pg_replication_slots
+ WHERE slot_name = '$slot' AND
+ invalidation_reason = 'inactive_timeout';
+ ])
+ or die
+ "Timed out while waiting for invalidation reason of slot $slot to
be set on node $name";

17a.
nit - /# Wait for invalidation reason to be set/# Check that the
invalidation reason is 'inactive_timeout'/

IIUC, the 'trigger_slot_invalidation' function has already invalidated
the slot at this point, so we are not really "Waiting..."; we are
"Checking..." that the reason was correctly set.

~

17b.
I think this code fragment maybe would be better put inside the
'trigger_slot_invalidation' function. (I've done this in the nitpicks
attachment)

~~~

18.
+ # Check that invalidated slot cannot be acquired
+ my ($result, $stdout, $stderr);
+
+ ($result, $stdout, $stderr) = $node->psql(
+ 'postgres', qq[
+ SELECT pg_replication_slot_advance('$slot', '0/1');
+ ]);

18a.
s/Check that invalidated slot/Check that an invalidated slot/

~

18b.
nit - Remove some blank lines, because the comment applies to all below it.

======
sub trigger_slot_invalidation

19.
+# Trigger slot invalidation and confirm it in server log
+sub trigger_slot_invalidation

nit - s/confirm it in server log/confirm it in the server log/

~

20.
+{
+ my ($node, $slot, $offset, $inactive_timeout) = @_;
+ my $name = $node->name;
+ my $invalidated = 0;

(same as the other subroutine)
nit - The variable $name seems too vague. How about $node_name?

======

Please refer to the attached nitpicks top-up patch which implements
most of the above nits.

======
Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment Content-Type Size
PS_NITPICKS_20240910_SLOT_V450001_TESTS.txt text/plain 7.4 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Sami Imseih 2024-09-10 02:27:00 Re: query_id, pg_stat_activity, extended query protocol
Previous Message Sami Imseih 2024-09-10 01:20:01 Re: query_id, pg_stat_activity, extended query protocol