Re: Introduce XID age and inactive timeout based replication slot invalidation

From: Peter Smith <smithpb2250(at)gmail(dot)com>
To: vignesh C <vignesh21(at)gmail(dot)com>
Cc: Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Introduce XID age and inactive timeout based replication slot invalidation
Date: 2024-12-05 01:13:40
Message-ID: CAHut+Psi-mso-qSyMkHRP8e+psDFFmMp4NZTiL_yrFWidHkGMw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Dec 4, 2024 at 9:27 PM vignesh C <vignesh21(at)gmail(dot)com> wrote:
>
...
>
> Currently, replication slots are invalidated based on the
> replication_slot_inactive_timeout only during a checkpoint. This means
> that if the checkpoint_timeout is set to a higher value than the
> replication_slot_inactive_timeout, slot invalidation will occur only
> when the checkpoint is triggered. Identifying the invalidation slots
> might be slightly delayed in this case. As an alternative, users can
> forcefully invalidate inactive slots that have exceeded the
> replication_slot_inactive_timeout by forcing a checkpoint. I was
> thinking we could suggest this in the documentation.
>
> + <para>
> + Slot invalidation due to inactive timeout occurs during checkpoint.
> + The duration of slot inactivity is calculated using the slot's
> + <link linkend="view-pg-replication-slots">pg_replication_slots</link>.<structfield>inactive_since</structfield>
> + value.
> + </para>
> +
>
> We could accurately invalidate the slots using the checkpointer
> process by calculating the invalidation time based on the active_since
> timestamp and the replication_slot_inactive_timeout, and then set the
> checkpointer's main wait-latch accordingly for triggering the next
> checkpoint. Ideally, a different process handling this task would be
> better, but there is currently no dedicated daemon capable of
> identifying and managing slots across streaming replication, logical
> replication, and other slots used by plugins. Additionally,
> overloading the checkpointer with this responsibility may not be
> ideal. As an alternative, we could document about this delay in
> identifying and mention that it could be triggered by forceful manual
> checkpoint.
>

Hi Vignesh.

I felt that manipulating the checkpoint timing behind the scenes
without the user's consent might be a bit of an overreach.

But there might still be something else we could do:

1. We can add the documentation note like you suggested ("we could
document about this delay in identifying and mention that it could be
triggered by forceful manual checkpoint").

2. We can also detect such delays in the code. When the invalidation
occurs (e.g. code fragment below) we could check if there was some
excessive lag between the slot becoming idle and it being invalidated.
If the lag is too much (whatever "too much" means) we can log a hint
for the user to increase the checkpoint frequency (or whatever else we
might advise them to do).

+ /*
+ * Check if the slot needs to be invalidated due to
+ * replication_slot_inactive_timeout GUC.
+ */
+ if (IsSlotInactiveTimeoutPossible(s) &&
+ TimestampDifferenceExceeds(s->inactive_since, now,
+ replication_slot_inactive_timeout_ms))
+ {
+ invalidation_cause = cause;
+ inactive_since = s->inactive_since;

pseudo-code:
if (slot invalidation occurred much later after the
replication_slot_inactive_timeout GUC elapsed)
{
elog(LOG, "This slot was inactive for a period of %s. Slot timeout
invalidation only occurs at a checkpoint so if you want inactive slots
to be invalidated in a more timely manner consider reducing the time
between checkpoints or executing a manual checkpoint.
(replication_slot_inactive_timeout = %s; checkpoint_timeout = %s,
....)"
}

+ }

======
Kind Regards,
Peter Smith.
Fujitsu Australia

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2024-12-05 01:25:14 Re: Cannot find a working 64-bit integer type on Illumos
Previous Message Thomas Munro 2024-12-05 01:06:09 Re: Cannot find a working 64-bit integer type on Illumos