Re: Introduce XID age and inactive timeout based replication slot invalidation

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: Peter Smith <smithpb2250(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Introduce XID age and inactive timeout based replication slot invalidation
Date: 2024-09-08 11:54:47
Message-ID: CALj2ACWXQT3_HY40ceqKf1DadjLQP6b1r=0sZRh-xhAOd-b0pA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Thanks for reviewing.

On Mon, Sep 2, 2024 at 1:37 PM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
>
> Commit message.
>
> 1.
> Because such synced slots are typically considered not
> active (for them to be later considered as inactive) as they don't
> perform logical decoding to produce the changes.
>
> This sentence is bad grammar. The docs have the same wording, so
> please see my doc review comment #4 suggestion below.

+1

> 2.
> + <para>
> + Invalidates replication slots that are inactive for longer than
> + specified amount of time. If this value is specified without units,
> + it is taken as seconds. A value of zero (which is default) disables
> + the timeout mechanism. This parameter can only be set in
> + the <filename>postgresql.conf</filename> file or on the server
> + command line.
> + </para>
> +
>
> nit - This is OK as-is, but OTOH why not make the wording consistent
> with the previous GUC description? (e.g. see my v43 [1] #2 review
> comment)

+1.

> 3.
> + <para>
> + This invalidation check happens either when the slot is acquired
> + for use or during checkpoint. The time since the slot has become
> + inactive is known from its
> + <structfield>inactive_since</structfield> value using which the
> + timeout is measured.
> + </para>
> +
>
> I felt this is slightly misleading because slot acquiring has nothing
> to do with setting the slot invalidation anymore. Furthermore, the 2nd
> sentence is bad grammar.
>
> nit - IMO something simple like the following rewording can address
> both of those points:
>
> Slot invalidation due to inactivity timeout occurs during checkpoint.
> The duration of slot inactivity is calculated using the slot's
> <structfield>inactive_since</structfield> field value.

+1.

> 4.
> + Because such synced slots are typically considered not active
> + (for them to be later considered as inactive) as they don't perform
> + logical decoding to produce the changes.
>
> That sentence has bad grammar.
>
> nit – suggest a much simpler replacement:
> Synced slots are always considered to be inactive because they don't
> perform logical decoding to produce changes.

+1.

> 5.
> +#define IsInactiveTimeoutSlotInvalidationApplicable(s) \
>
> 5a.
> I felt this would be better implemented as an inline function. Then it
> can be commented on properly to explain the parts of the condition.
> e.g. the large comment currently in InvalidatePossiblyObsoleteSlot()
> would be more appropriate in this function.

+1.

> 5b.
> The name is very long. Can't it be something shorter/simpler like:
> 'IsSlotATimeoutCandidate()'
>
> ~~~

Missing inactive in the above suggested name. Used
SlotInactiveTimeoutCheckAllowed, similar to XLogInsertAllowed.

> 6. ReplicationSlotAcquire
>
> -ReplicationSlotAcquire(const char *name, bool nowait)
> +ReplicationSlotAcquire(const char *name, bool nowait,
> + bool check_for_invalidation)
>
> nit - Previously this new parameter really did mean to "check" for
> [and set the slot] invalidation. But now I suggest renaming it to
> 'error_if_invalid' to properly reflect the new usage. And also in the
> slot.h.

+1.

> 7.
> + /*
> + * Error out if the slot has been invalidated previously. Because there's
> + * no use in acquiring the invalidated slot.
> + */
>
> nit - The comment is contrary to the code. If there was no reason to
> skip this error, then you would not have the new parameter allowing
> you to skip this error. I suggest just repeating the same comment as
> in the function header.

+1.

> 8. ReportSlotInvalidation
>
> nit - Added some blank lines for consistency.

+1.

> 9. InvalidatePossiblyObsoleteSlot
>
> 9a.
> Consistency is good (commit message, docs and code comments for this),
> but the added sentence has bad grammar. Please see the docs review
> comment #4 above for some alternate phrasing.

+1.

> 9b.
> Now that this logic is moved into a macro (I suggested it should be an
> inline function) IMO this comment does not belong here anymore because
> it is commenting code that you cannot see. Instead, this comment (or
> something like it) should be as comments within the new function.
>
> ======
> src/include/replication/slot.h

+1.

> 10.
> +extern void ReplicationSlotAcquire(const char *name, bool nowait,
> + bool check_for_invalidation);
>
> Change the new param name as described in the earlier review comment.

+1.

> Please refer to the attached file which implements some of the nits
> mentioned above.

Merged the diff into v45. Thanks.

On Tue, Sep 3, 2024 at 12:26 PM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
>
> TEST CASE #1
>
> 1.
> +# Wait for the inactive replication slot to be invalidated.
>
> Is that comment correct? IIUC the synced slot should *already* be
> invalidated from the primary, so here we are not really "waiting" for
> it to be invalidated; Instead, we are just "confirming" that the
> synchronized slot is already invalidated with the correct reason as
> expected.

Modified the comment.

> 2.
> +# Synced slot mustn't get invalidated on the standby even after a checkpoint,
> +# it must sync invalidation from the primary. So, we must not see the slot's
> +# invalidation message in server log.
>
> This test case seemed bogus, for a couple of reasons:
>
> 2a. IIUC this 'lsub1_sync_slot' is the same one that is already
> invalid (from the primary), so nobody should be surprised that an
> already invalid slot doesn't get flagged as invalid again. i.e.
> Shouldn't your test scenario here be done using a valid synced slot?

+1. Added another test case for checking the synced slot not getting
invalidated despite inactive timeout being set.

> 2b. AFAICT it was only moments above this CHECKPOINT where you
> assigned the standby inactivity timeout to 2s. So even if there was
> some bug invalidating synced slots I don't think you gave it enough
> time to happen -- e.g. I doubt 2s has elapsed yet.

Added sleep(timeout+1) before the checkpoint.

> 3.
> +# Stop standby to make the standby's replication slot on the primary inactive
> +$standby1->stop;
> +
> +# Wait for the standby's replication slot to become inactive
>
> TEST CASE #2
>
> 4.
> +# Stop subscriber to make the replication slot on publisher inactive
> +$subscriber->stop;
> +
> +# Wait for the replication slot to become inactive and then invalidated due to
> +# timeout.
> +wait_for_slot_invalidation($publisher, 'lsub1_slot', $logstart,
> + $inactive_timeout);
>
> IIUC, this is just like comment #3 above. Both these (the stop and the
> wait) seem to belong together, so I think maybe a single bigger
> explanatory comment covering both parts would help for understanding.

Done.

> 5.
> +# Testcase end: Invalidate logical subscriber's slot due to
> +# replication_slot_inactive_timeout.
> +# =============================================================================
>
> IMO the rest of the comment after "Testcase end" isn't very useful.

Removed.

> ======
> sub wait_for_slot_invalidation
>
> 6.
> +sub wait_for_slot_invalidation
> +{
>
> An explanatory header comment for this subroutine would be helpful.

Done.

> 7.
> + # Wait for the replication slot to become inactive
> + $node->poll_query_until(
>
> Why are there are 2 separate poll_query_until's here? Can't those be
> combined into just one?

Ah. My bad. Removed.

> ~~~
>
> 8.
> + # Sleep at least $inactive_timeout duration to avoid multiple checkpoints
> + # for the slot to get invalidated.
> + sleep($inactive_timeout);
> +
>
> Maybe this special sleep to prevent too many CHECKPOINTs should be
> moved to be inside the other subroutine, which is actually doing those
> CHECKPOINTs.

Done.

> 9.
> + # Wait for the inactive replication slot to be invalidated
> + "Timed out while waiting for inactive slot $slot_name to be
> invalidated on node $name";
> +
>
> The comment seems misleading. IIUC you are not "waiting" for the
> invalidation here, because it is the other subroutine doing the
> waiting for the invalidation message in the logs. Instead, here I
> think you are just confirming the 'invalidation_reason' got set
> correctly. The comment should say what it is really doing.

Modified.

> sub check_for_slot_invalidation_in_server_log
>
> 10.
> +# Check for invalidation of slot in server log
> +sub check_for_slot_invalidation_in_server_log
> +{
>
> I think the main function of this subroutine is the CHECKPOINT and the
> waiting for the server log to say invalidation happened. It is doing a
> loop of a) CHECKPOINT then b) inspecting the server log for the slot
> invalidation, and c) waiting for a bit. Repeat 10 times.
>
> A comment describing the logic for this subroutine would be helpful.
>
> The most important side-effect of this function is the CHECKPOINT
> because without that nothing will ever get invalidated due to
> inactivity, but this key point is not obvious from the subroutine
> name.
>
> IMO it would be better to name this differently to reflect what it is
> really doing:
> e.g. "CHECKPOINT_and_wait_for_slot_invalidation_in_server_log"

That would be too long. Changed the function name to
trigger_slot_invalidation() which is appropriate.

Please find the v45 patch. Addressed above and Shveta's review comments [1].

Amit's comments [2] and [3] are still pending.

[1] https://www.postgresql.org/message-id/CAJpy0uC8Dg-0JS3NRUwVUemgz5Ar2v3_EQQFXyAigWSEQ8U47Q%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAA4eK1K7DdT_5HnOWs5tVPYC%3D-h%2Bm85wu7k-7RVJaJ7zMxprWQ%40mail.gmail.com
[3] https://www.postgresql.org/message-id/CAA4eK1%2Bkt-QRr1RP%3DD%3D4_tp%2BS%2BCErQ6rNe7KVYEyZ3f6PYXpvw%40mail.gmail.com

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v45-0001-Add-inactive_timeout-based-replication-slot-inva.patch application/octet-stream 34.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2024-09-08 11:55:42 Re: Introduce XID age and inactive timeout based replication slot invalidation
Previous Message Oliver Ford 2024-09-08 11:43:49 Re: [PoC] Add CANONICAL option to xmlserialize