Re: Introduce XID age and inactive timeout based replication slot invalidation

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Álvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Introduce XID age and inactive timeout based replication slot invalidation
Date: 2025-02-17 02:27:22
Message-ID: CAA4eK1+Oefb-dxBfi178YrW3wvmBZA2ymz5ctAGo=82pxG74Wg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 12, 2025 at 1:16 PM Zhijie Hou (Fujitsu)
<houzj(dot)fnst(at)fujitsu(dot)com> wrote:
>
> On Wednesday, February 12, 2025 11:56 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Tue, Feb 11, 2025 at 9:39 PM Nathan Bossart
> > <nathandbossart(at)gmail(dot)com> wrote:
> > >
> > > On Tue, Feb 11, 2025 at 03:22:49PM +0100, Álvaro Herrera wrote:
> > > > I find this proposed patch a bit strange and I feel it needs more
> > > > explanation.
> > > >
> > > > When this thread started, Bharath justified his patches saying that
> > > > a slot that's inactive for a very long time could be problematic
> > > > because of XID wraparound. Fine, that sounds a reasonable feature.
> > > > If you wanted to invalidate slots whose xmins were too old, I would
> > > > support that. He submitted that as his 0004 patch then.
> > > >
> > > > However, he also chose to submit 0003 with invalidation based on a
> > > > timeout. This is far less convincing a feature to me. The
> > > > justification for the time out seems to be that ... it's difficult
> > > > to have a one-size-fits-all value because size of disks vary. (???)
> > > > Or something like that. Really? I mean -- yes, this will prevent
> > > > problems in toy databases when run in developer's laptops. It will
> > > > not prevent any problems in production databases. Do we really want
> > > > a setting that is only useful for toy situations rather than production?
> > > >
> > > >
> > ...
> > > >
> > > > I'm baffled.
> > >
> > > I agree, and I am also baffled because I think this discussion has
> > > happened at least once already on this thread.
> > >
> >
> > Yes, we previously discussed this topic and Robert seems to prefer a
> > time-based parameter for invalidating the slot (1)(2) as it is easier to reason in
> > terms of time. The other points discussed previously were that there are tools
> > that create a lot of slots and sometimes forget to clean up slots. Bharath has
> > seen this in production and we now have the tool pg_createsubscriber that
> > creates a slot-per-database, so if for some reason, such slots are not cleaned
> > on the tool's exit, such a parameter could save the cluster. See (3)(4).
> >
> > Also, we previously didn't have a good experience with XID-based threshold
> > parameters like vacuum_defer_cleanup_age as mentioned by Robert (1).
> > AFAICU from the previous discussion we need a time-based parameter and we
> > didn't rule out xid_age based parameter as another parameter.
>
> Yeah, I think the primary purpose of this time-based option is to invalidate dormant
> replication slots that have been inactive for a long period, in which case the
> slots are no longer useful.
>
> Such slots can remain if a subscriber is down due to a system error or
> inaccessible because of network issues. If this situation persists, it might be
> more practical to recreate the subscriber rather than attempt to recover the
> node and wait for it to catch up, which could be time-consuming.
>
> Parameters like max_slot_wal_keep_size and max_slot_xid_id_age do not
> differentiate between active and inactive replication slots. Some customers I
> met are hesitant about using these settings, as they can sometimes invalidate
> a slot unnecessarily and break the replication.
>

Alvaro, Nathan, do let us know if you would like to discuss more on
the use case for this new GUC idle_replication_slot_timeout?
Otherwise, we can proceed with this patch.

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2025-02-17 03:00:58 Re: Add Postgres module info
Previous Message Richard Guo 2025-02-17 02:18:24 Re: Adjust tuples estimate for appendrels