Re: Introduce XID age and inactive timeout based replication slot invalidation

From: Ajin Cherian <itsajin(at)gmail(dot)com>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Introduce XID age and inactive timeout based replication slot invalidation
Date: 2024-08-12 12:17:55
Message-ID: CAFPTHDargjqdW4DAh1n4j7ONW3H=KV7pOzy=EjHNhwERAFnRGQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jun 24, 2024 at 4:01 PM Bharath Rupireddy <
bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:

> Hi,
>
> On Mon, Jun 17, 2024 at 5:55 PM Bharath Rupireddy
> <bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:
> >
> > Here are my thoughts on when to do the XID age invalidation. In all
> > the patches sent so far, the XID age invalidation happens in two
> > places - one during the slot acquisition, and another during the
> > checkpoint. As the suggestion is to do it during the vacuum (manual
> > and auto), so that even if the checkpoint isn't happening in the
> > database for whatever reasons, a vacuum command or autovacuum can
> > invalidate the slots whose XID is aged.
> >
> > An idea is to check for XID age based invalidation for all the slots
> > in ComputeXidHorizons() before it reads replication_slot_xmin and
> > replication_slot_catalog_xmin, and obviously before the proc array
> > lock is acquired. A potential problem with this approach is that the
> > invalidation check can become too aggressive as XID horizons are
> > computed from many places.
> >
> > Another idea is to check for XID age based invalidation for all the
> > slots in higher levels than ComputeXidHorizons(), for example in
> > vacuum() which is an entry point for both vacuum command and
> > autovacuum. This approach seems similar to vacuum_failsafe_age GUC
> > which checks each relation for the failsafe age before vacuum gets
> > triggered on it.
>
> I am attaching the patches implementing the idea of invalidating
> replication slots during vacuum when current slot xmin limits
> (procArray->replication_slot_xmin and
> procArray->replication_slot_catalog_xmin) are aged as per the new XID
> age GUC. When either of these limits are aged, there must be at least
> one replication slot that is aged, because the xmin limits, after all,
> are the minimum of xmin or catalog_xmin of all replication slots. In
> this approach, the new XID age GUC will help vacuum when needed,
> because the current slot xmin limits are recalculated after
> invalidating replication slots that are holding xmins for longer than
> the age. The code is placed in vacuum() which is common for both
> vacuum command and autovacuum, and gets executed only once every
> vacuum cycle to not be too aggressive in invalidating.
>
> However, there might be some concerns with this approach like the
> following:
> 1) Adding more code to vacuum might not be acceptable
> 2) What if invalidation of replication slots emits an error, will it
> block vacuum forever? Currently, InvalidateObsoleteReplicationSlots()
> is also called as part of the checkpoint, and emitting ERRORs from
> within is avoided already. Therefore, there is no concern here for
> now.
> 3) What if there are more replication slots to be invalidated, will it
> delay the vacuum? If yes, by how much? <<TODO>>
> 4) Will the invalidation based on just current replication slot xmin
> limits suffice irrespective of vacuum cutoffs? IOW, if the replication
> slots are invalidated but vacuum isn't going to do any work because
> vacuum cutoffs are not yet met? Is the invalidation work wasteful
> here?
> 5) Is it okay to take just one more time the proc array lock to get
> current replication slot xmin limits via
> ProcArrayGetReplicationSlotXmin() once every vacuum cycle? <<TODO>>
> 6) Vacuum command can't be run on the standby in recovery. So, to help
> invalidate replication slots on the standby, I have for now let the
> checkpointer also do the XID age based invalidation. I know
> invalidating both in checkpointer and vacuum may not be a great idea,
> but I'm open to thoughts.
>
> Following are some of the alternative approaches which IMHO don't help
> vacuum when needed:
> a) Let the checkpointer do the XID age based invalidation, and call it
> out in the documentation that if the checkpoint doesn't happen, the
> new GUC doesn't help even if the vacuum is run. This has been the
> approach until v40 patch.
> b) Checkpointer and/or other backends add an autovacuum work item via
> AutoVacuumRequestWork(), and autovacuum when it gets to it will
> invalidate the replication slots. But, what to do for the vacuum
> command here?
>
> Please find the attached v41 patches implementing the idea of vacuum
> doing the invalidation.
>
> Thoughts?
>
> Thanks to Sawada-san for a detailed off-list discussion.
>

The patch no longer applies on HEAD, please rebase.

regards,
Ajin Cherian
Fujitsu Australia

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Aleksander Alekseev 2024-08-12 12:34:52 Re: libpq minor TOCTOU violation
Previous Message David Rowley 2024-08-12 11:59:31 Re: Typos in the code and README