Re: Introduce XID age and inactive timeout based replication slot invalidation

From: Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>
To: Peter Smith <smithpb2250(at)gmail(dot)com>
Cc: shveta malik <shveta(dot)malik(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Introduce XID age and inactive timeout based replication slot invalidation
Date: 2024-11-19 07:17:37
Message-ID: CABdArM4w-rBmSMqPvsWLdj+OwLZzOZ-MX_596qhHmc+XvP9dVg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Nov 14, 2024 at 5:29 AM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
>
> Hi Nisha.
>
> Thanks for the recent patch updates. Here are my review comments for
> the latest patch v48-0001.
>

Thank you for the review. Comments are addressed in v49 version.
Below is my response to comments that may require further discussion.

> ======
> doc/src/sgml/system-views.sgml
>
> 2.
> <para>
> The time since the slot has become inactive.
> - <literal>NULL</literal> if the slot is currently being used.
> - Note that for slots on the standby that are being synced from a
> + <literal>NULL</literal> if the slot is currently being used. Once the
> + slot is invalidated, this value will remain unchanged until we shutdown
> + the server. Note that for slots on the standby that are being
> synced from a
> primary server (whose <structfield>synced</structfield> field is
> <literal>true</literal>), the
>
> Is this change related to the new inactivity timeout feature or are
> you just clarifying the existing behaviour of the 'active_since'
> field.
>

Yes, this patch introduces inactive_timeout invalidation and prevents
updates to inactive_since for invalid slots. Only a node restart can
modify it, so, I believe we should retain these lines in this patch.

> Note there is already another thread [1] created to patch/clarify this
> same field. So if you are just clarifying existing behavior then IMO
> it would be better if you can to try and get your desired changes
> included there quickly before that other patch gets pushed.
>

Thanks for the reference, I have posted my suggestion on the thread.

>
> ReplicationSlotAcquire:
>
> 5.
> + *
> + * An error is raised if error_if_invalid is true and the slot has been
> + * invalidated previously.
> */
> void
> -ReplicationSlotAcquire(const char *name, bool nowait)
> +ReplicationSlotAcquire(const char *name, bool nowait, bool error_if_invalid)
>
> This function comment makes it seem like "invalidated previously"
> might mean *any* kind of invalidation, but later in the body of the
> function we find the logic is really only used for inactive timeout.
>
> + /*
> + * An error is raised if error_if_invalid is true and the slot has been
> + * previously invalidated due to inactive timeout.
> + */
>
> So, I think a better name for that parameter might be
> 'error_if_inactive_timeout'
>
> OTOH, if it really is supposed to erro for *any* kind of invalidation
> then there needs to be more ereports.
>

+1 to the idea.
I have created a separate patch v49-0001 adding more ereports for all
kinds of invalidations.

> ~~~
> SlotInactiveTimeoutCheckAllowed:
>
> 8.
> +/*
> + * Is this replication slot allowed for inactive timeout invalidation check?
> + *
> + * Inactive timeout invalidation is allowed only when:
> + *
> + * 1. Inactive timeout is set
> + * 2. Slot is inactive
> + * 3. Server is in recovery and slot is not being synced from the primary
> + *
> + * Note that the inactive timeout invalidation mechanism is not
> + * applicable for slots on the standby server that are being synced
> + * from the primary server (i.e., standby slots having 'synced' field 'true').
> + * Synced slots are always considered to be inactive because they don't
> + * perform logical decoding to produce changes.
> + */
>
> 8a.
> Somehow that first sentence seems strange. Would it be better to write it like:
>
> SUGGESTION
> Can this replication slot timeout due to inactivity?
>

I feel the suggestion is not very clear on the purpose of the
function, This function doesn't check inactivity or decide slot
timeout invalidation. It only pre-checks if the slot qualifies for an
inactivity check, which the caller will perform.
As I have changed function name too as per commnet#9, I used the following -
"Is inactive timeout invalidation possible for this replication slot?"
Thoughts?

> ~
> 8c.
> Similarly, I think something about that "Note that the inactive
> timeout invalidation mechanism is not applicable..." paragraph needs
> tweaking because IMO that should also now be saying something about
> 'RecoveryInProgress'.
>

'RecoveryInProgress' check indicates that the server is a standby, and
the mentioned paragraph uses the term "standby" to describe the
condition. It seems unnecessary to mention RecoveryInProgress
separately.

> ~~~
>
> InvalidatePossiblyObsoleteSlot:
>
> 10.
> break;
> + case RS_INVAL_INACTIVE_TIMEOUT:
> +
> + /*
> + * Check if the slot needs to be invalidated due to
> + * replication_slot_inactive_timeout GUC.
> + */
>
> Since there are no other blank lines anywhere in this switch, the
> introduction of this one in v48 looks out of place to me.

pgindent automatically added this blank line after 'case
RS_INVAL_INACTIVE_TIMEOUT'.

> IMO it would
> be more readable if a blank line followed each/every of the breaks,
> but then that is not a necessary change for this patch so...
>

Since it's not directly related to the patch, I feel it might be best
to leave it as is for now.

> ~~~
>
> 11.
> + /*
> + * Invalidation due to inactive timeout implies that
> + * no one is using the slot.
> + */
> + Assert(s->active_pid == 0);
>
> Given this assertion, does it mean that "(s->active_pid == 0)" should
> have been another condition done up-front in the function
> 'SlotInactiveTimeoutCheckAllowed'?
>

I don't think it's a good idea to check (s->active_pid == 0) upfront,
before the timeout-invalidation check. AFAIU, this assertion is meant
to ensure active_pid = 0 only if the slot is going to be invalidated,
i.e., when the following condition is true:

TimestampDifferenceExceeds(s->inactive_since, now,

replication_slot_inactive_timeout_sec * 1000)

Thoughts? Open to others' opinions too.

> ~~~
>
> 12.
> /*
> - * If the slot can be acquired, do so and mark it invalidated
> - * immediately. Otherwise we'll signal the owning process, below, and
> - * retry.
> + * If the slot can be acquired, do so and mark it as invalidated. If
> + * the slot is already ours, mark it as invalidated. Otherwise, we'll
> + * signal the owning process below and retry.
> */
> - if (active_pid == 0)
> + if (active_pid == 0 ||
> + (MyReplicationSlot == s &&
> + active_pid == MyProcPid))
>
> I wasn't sure how this change belongs to this patch, because the logic
> of the previous review comment said for the case of invalidation due
> to inactivity that active_id must be 0. e.g. Assert(s->active_pid ==
> 0);
>

I don't fully understand the purpose of this change yet. I'll look
into it further and get back.

> ~~~
>
> RestoreSlotFromDisk:
>
> 13.
> - slot->inactive_since = GetCurrentTimestamp();
> + slot->inactive_since = now;
>
> In v47 this assignment used to call the function
> 'ReplicationSlotSetInactiveSince'. I recognise there is a very subtle
> difference between direct assignment and the function, because the
> function will skip assignment if the slot is already invalidated.
> Anyway, if you are *deliberately* not wanting to call
> ReplicationSlotSetInactiveSince here then I think this assignment
> should be commented to explain the reason why not, otherwise someone
> in the future might be tempted to think it was just an oversight and
> add the call back in that you don't want.
>

Added comment saying avoid using ReplicationSlotSetInactiveSince()
here as it will skip the invalid slots.

~~~~

--
Thanks,
Nisha

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Nisha Moond 2024-11-19 07:20:51 Re: Introduce XID age and inactive timeout based replication slot invalidation
Previous Message Shlok Kyal 2024-11-19 07:14:37 Re: Disallow UPDATE/DELETE on table with unpublished generated column as REPLICA IDENTITY