Quick Links

Re: Introduce XID age and inactive timeout based replication slot invalidation

From:	vignesh C <vignesh21(at)gmail(dot)com>
To:	Peter Smith <smithpb2250(at)gmail(dot)com>
Cc:	Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Introduce XID age and inactive timeout based replication slot invalidation
Date:	2024-12-06 05:34:29
Message-ID:	CALDaNm3znxKvTv=MDmMBOSk6XurkKgYyh9cH49NLL7SmJ62Q_A@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Thu, 5 Dec 2024 at 06:44, Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
>
> On Wed, Dec 4, 2024 at 9:27 PM vignesh C <vignesh21(at)gmail(dot)com> wrote:
> >
> ...
> >
> > Currently, replication slots are invalidated based on the
> > replication_slot_inactive_timeout only during a checkpoint. This means
> > that if the checkpoint_timeout is set to a higher value than the
> > replication_slot_inactive_timeout, slot invalidation will occur only
> > when the checkpoint is triggered. Identifying the invalidation slots
> > might be slightly delayed in this case. As an alternative, users can
> > forcefully invalidate inactive slots that have exceeded the
> > replication_slot_inactive_timeout by forcing a checkpoint. I was
> > thinking we could suggest this in the documentation.
> >
> > + <para>
> > + Slot invalidation due to inactive timeout occurs during checkpoint.
> > + The duration of slot inactivity is calculated using the slot's
> > + <link linkend="view-pg-replication-slots">pg_replication_slots</link>.<structfield>inactive_since</structfield>
> > + value.
> > + </para>
> > +
> >
> > We could accurately invalidate the slots using the checkpointer
> > process by calculating the invalidation time based on the active_since
> > timestamp and the replication_slot_inactive_timeout, and then set the
> > checkpointer's main wait-latch accordingly for triggering the next
> > checkpoint. Ideally, a different process handling this task would be
> > better, but there is currently no dedicated daemon capable of
> > identifying and managing slots across streaming replication, logical
> > replication, and other slots used by plugins. Additionally,
> > overloading the checkpointer with this responsibility may not be
> > ideal. As an alternative, we could document about this delay in
> > identifying and mention that it could be triggered by forceful manual
> > checkpoint.
> >
>
> Hi Vignesh.
>
> I felt that manipulating the checkpoint timing behind the scenes
> without the user's consent might be a bit of an overreach.

Agree

> But there might still be something else we could do:
>
> 1. We can add the documentation note like you suggested ("we could
> document about this delay in identifying and mention that it could be
> triggered by forceful manual checkpoint").

Yes, that makes sense

> 2. We can also detect such delays in the code. When the invalidation
> occurs (e.g. code fragment below) we could check if there was some
> excessive lag between the slot becoming idle and it being invalidated.
> If the lag is too much (whatever "too much" means) we can log a hint
> for the user to increase the checkpoint frequency (or whatever else we
> might advise them to do).
>
> + /*
> + * Check if the slot needs to be invalidated due to
> + * replication_slot_inactive_timeout GUC.
> + */
> + if (IsSlotInactiveTimeoutPossible(s) &&
> + TimestampDifferenceExceeds(s->inactive_since, now,
> + replication_slot_inactive_timeout_ms))
> + {
> + invalidation_cause = cause;
> + inactive_since = s->inactive_since;
>
> pseudo-code:
> if (slot invalidation occurred much later after the
> replication_slot_inactive_timeout GUC elapsed)
> {
> elog(LOG, "This slot was inactive for a period of %s. Slot timeout
> invalidation only occurs at a checkpoint so if you want inactive slots
> to be invalidated in a more timely manner consider reducing the time
> between checkpoints or executing a manual checkpoint.
> (replication_slot_inactive_timeout = %s; checkpoint_timeout = %s,
> ....)"
> }
>
> + }

Determining the correct time may be challenging for users, as it
depends on when the active_since value is set, as well as when the
checkpoint_timeout occurs and the subsequent checkpoint is triggered.
Even if the user sets it to an appropriate value, there is still a
possibility of delayed identification due to the timing of when the
slot's active_timeout is being set. Including this information in the
documentation should be sufficient.

Regards,
Vignesh

In response to

Re: Introduce XID age and inactive timeout based replication slot invalidation at 2024-12-05 01:13:40 from Peter Smith

Responses

Re: Introduce XID age and inactive timeout based replication slot invalidation at 2024-12-10 11:51:09 from Nisha Moond

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Kirill Reshke	2024-12-06 05:40:18	Re: Use streaming read API in pgstattuple.
Previous Message	Srinath Reddy Sadipiralla	2024-12-06 04:03:30	Re: Why we need to check for local buffers in BufferIsExclusiveLocked and BufferIsDirty?