Re: Introduce XID age and inactive timeout based replication slot invalidation

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Introduce XID age and inactive timeout based replication slot invalidation
Date: 2024-06-24 06:00:00
Message-ID: CALj2ACXQNGXokgx8APwdxrG4MHMF=cOz6XQtUL7EHua9oUfkgA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Mon, Jun 17, 2024 at 5:55 PM Bharath Rupireddy
<bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:
>
> Here are my thoughts on when to do the XID age invalidation. In all
> the patches sent so far, the XID age invalidation happens in two
> places - one during the slot acquisition, and another during the
> checkpoint. As the suggestion is to do it during the vacuum (manual
> and auto), so that even if the checkpoint isn't happening in the
> database for whatever reasons, a vacuum command or autovacuum can
> invalidate the slots whose XID is aged.
>
> An idea is to check for XID age based invalidation for all the slots
> in ComputeXidHorizons() before it reads replication_slot_xmin and
> replication_slot_catalog_xmin, and obviously before the proc array
> lock is acquired. A potential problem with this approach is that the
> invalidation check can become too aggressive as XID horizons are
> computed from many places.
>
> Another idea is to check for XID age based invalidation for all the
> slots in higher levels than ComputeXidHorizons(), for example in
> vacuum() which is an entry point for both vacuum command and
> autovacuum. This approach seems similar to vacuum_failsafe_age GUC
> which checks each relation for the failsafe age before vacuum gets
> triggered on it.

I am attaching the patches implementing the idea of invalidating
replication slots during vacuum when current slot xmin limits
(procArray->replication_slot_xmin and
procArray->replication_slot_catalog_xmin) are aged as per the new XID
age GUC. When either of these limits are aged, there must be at least
one replication slot that is aged, because the xmin limits, after all,
are the minimum of xmin or catalog_xmin of all replication slots. In
this approach, the new XID age GUC will help vacuum when needed,
because the current slot xmin limits are recalculated after
invalidating replication slots that are holding xmins for longer than
the age. The code is placed in vacuum() which is common for both
vacuum command and autovacuum, and gets executed only once every
vacuum cycle to not be too aggressive in invalidating.

However, there might be some concerns with this approach like the following:
1) Adding more code to vacuum might not be acceptable
2) What if invalidation of replication slots emits an error, will it
block vacuum forever? Currently, InvalidateObsoleteReplicationSlots()
is also called as part of the checkpoint, and emitting ERRORs from
within is avoided already. Therefore, there is no concern here for
now.
3) What if there are more replication slots to be invalidated, will it
delay the vacuum? If yes, by how much? <<TODO>>
4) Will the invalidation based on just current replication slot xmin
limits suffice irrespective of vacuum cutoffs? IOW, if the replication
slots are invalidated but vacuum isn't going to do any work because
vacuum cutoffs are not yet met? Is the invalidation work wasteful
here?
5) Is it okay to take just one more time the proc array lock to get
current replication slot xmin limits via
ProcArrayGetReplicationSlotXmin() once every vacuum cycle? <<TODO>>
6) Vacuum command can't be run on the standby in recovery. So, to help
invalidate replication slots on the standby, I have for now let the
checkpointer also do the XID age based invalidation. I know
invalidating both in checkpointer and vacuum may not be a great idea,
but I'm open to thoughts.

Following are some of the alternative approaches which IMHO don't help
vacuum when needed:
a) Let the checkpointer do the XID age based invalidation, and call it
out in the documentation that if the checkpoint doesn't happen, the
new GUC doesn't help even if the vacuum is run. This has been the
approach until v40 patch.
b) Checkpointer and/or other backends add an autovacuum work item via
AutoVacuumRequestWork(), and autovacuum when it gets to it will
invalidate the replication slots. But, what to do for the vacuum
command here?

Please find the attached v41 patches implementing the idea of vacuum
doing the invalidation.

Thoughts?

Thanks to Sawada-san for a detailed off-list discussion.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v41-0001-Add-inactive_timeout-based-replication-slot-inva.patch application/x-patch 33.6 KB
v41-0002-Add-XID-age-based-replication-slot-invalidation.patch application/x-patch 32.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Richard Guo 2024-06-24 06:29:53 Re: Inconsistency between try_mergejoin_path and create_mergejoin_plan
Previous Message Li Japin 2024-06-24 05:27:54 Re: Support "Right Semi Join" plan shapes