From: | vignesh C <vignesh21(at)gmail(dot)com> |
---|---|
To: | Peter Smith <smithpb2250(at)gmail(dot)com> |
Cc: | Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Introduce XID age and inactive timeout based replication slot invalidation |
Date: | 2024-12-06 05:34:29 |
Message-ID: | CALDaNm3znxKvTv=MDmMBOSk6XurkKgYyh9cH49NLL7SmJ62Q_A@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, 5 Dec 2024 at 06:44, Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
>
> On Wed, Dec 4, 2024 at 9:27 PM vignesh C <vignesh21(at)gmail(dot)com> wrote:
> >
> ...
> >
> > Currently, replication slots are invalidated based on the
> > replication_slot_inactive_timeout only during a checkpoint. This means
> > that if the checkpoint_timeout is set to a higher value than the
> > replication_slot_inactive_timeout, slot invalidation will occur only
> > when the checkpoint is triggered. Identifying the invalidation slots
> > might be slightly delayed in this case. As an alternative, users can
> > forcefully invalidate inactive slots that have exceeded the
> > replication_slot_inactive_timeout by forcing a checkpoint. I was
> > thinking we could suggest this in the documentation.
> >
> > + <para>
> > + Slot invalidation due to inactive timeout occurs during checkpoint.
> > + The duration of slot inactivity is calculated using the slot's
> > + <link linkend="view-pg-replication-slots">pg_replication_slots</link>.<structfield>inactive_since</structfield>
> > + value.
> > + </para>
> > +
> >
> > We could accurately invalidate the slots using the checkpointer
> > process by calculating the invalidation time based on the active_since
> > timestamp and the replication_slot_inactive_timeout, and then set the
> > checkpointer's main wait-latch accordingly for triggering the next
> > checkpoint. Ideally, a different process handling this task would be
> > better, but there is currently no dedicated daemon capable of
> > identifying and managing slots across streaming replication, logical
> > replication, and other slots used by plugins. Additionally,
> > overloading the checkpointer with this responsibility may not be
> > ideal. As an alternative, we could document about this delay in
> > identifying and mention that it could be triggered by forceful manual
> > checkpoint.
> >
>
> Hi Vignesh.
>
> I felt that manipulating the checkpoint timing behind the scenes
> without the user's consent might be a bit of an overreach.
Agree
> But there might still be something else we could do:
>
> 1. We can add the documentation note like you suggested ("we could
> document about this delay in identifying and mention that it could be
> triggered by forceful manual checkpoint").
Yes, that makes sense
> 2. We can also detect such delays in the code. When the invalidation
> occurs (e.g. code fragment below) we could check if there was some
> excessive lag between the slot becoming idle and it being invalidated.
> If the lag is too much (whatever "too much" means) we can log a hint
> for the user to increase the checkpoint frequency (or whatever else we
> might advise them to do).
>
> + /*
> + * Check if the slot needs to be invalidated due to
> + * replication_slot_inactive_timeout GUC.
> + */
> + if (IsSlotInactiveTimeoutPossible(s) &&
> + TimestampDifferenceExceeds(s->inactive_since, now,
> + replication_slot_inactive_timeout_ms))
> + {
> + invalidation_cause = cause;
> + inactive_since = s->inactive_since;
>
> pseudo-code:
> if (slot invalidation occurred much later after the
> replication_slot_inactive_timeout GUC elapsed)
> {
> elog(LOG, "This slot was inactive for a period of %s. Slot timeout
> invalidation only occurs at a checkpoint so if you want inactive slots
> to be invalidated in a more timely manner consider reducing the time
> between checkpoints or executing a manual checkpoint.
> (replication_slot_inactive_timeout = %s; checkpoint_timeout = %s,
> ....)"
> }
>
> + }
Determining the correct time may be challenging for users, as it
depends on when the active_since value is set, as well as when the
checkpoint_timeout occurs and the subsequent checkpoint is triggered.
Even if the user sets it to an appropriate value, there is still a
possibility of delayed identification due to the timing of when the
slot's active_timeout is being set. Including this information in the
documentation should be sufficient.
Regards,
Vignesh
From | Date | Subject | |
---|---|---|---|
Next Message | Kirill Reshke | 2024-12-06 05:40:18 | Re: Use streaming read API in pgstattuple. |
Previous Message | Srinath Reddy Sadipiralla | 2024-12-06 04:03:30 | Re: Why we need to check for local buffers in BufferIsExclusiveLocked and BufferIsDirty? |