Re: Introduce XID age and inactive timeout based replication slot invalidation

From: Nathan Bossart <nathandbossart(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, Álvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Introduce XID age and inactive timeout based replication slot invalidation
Date: 2025-02-17 16:48:49
Message-ID: Z7NocSKY8FHa8zhT@nathan
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Feb 17, 2025 at 07:57:22AM +0530, Amit Kapila wrote:
> On Wed, Feb 12, 2025 at 1:16 PM Zhijie Hou (Fujitsu)
> <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
>> On Wednesday, February 12, 2025 11:56 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>> > Also, we previously didn't have a good experience with XID-based threshold
>> > parameters like vacuum_defer_cleanup_age as mentioned by Robert (1).
>> > AFAICU from the previous discussion we need a time-based parameter and we
>> > didn't rule out xid_age based parameter as another parameter.

I am not sure I buy the comparison with vacuum_defer_cleanup_age. That is
a very different feature than max_slot_xid_age, and we still have a number
of XID-based parameters (vacuum_freeze_table_age, vacuum_freeze_min_age,
vacuum_failsafe_age, the multixact versions of those parameters, and the
autovacuum versions).

>> Yeah, I think the primary purpose of this time-based option is to invalidate dormant
>> replication slots that have been inactive for a long period, in which case the
>> slots are no longer useful.
>>
>> Such slots can remain if a subscriber is down due to a system error or
>> inaccessible because of network issues. If this situation persists, it might be
>> more practical to recreate the subscriber rather than attempt to recover the
>> node and wait for it to catch up, which could be time-consuming.
>>
>> Parameters like max_slot_wal_keep_size and max_slot_xid_id_age do not
>> differentiate between active and inactive replication slots. Some customers I
>> met are hesitant about using these settings, as they can sometimes invalidate
>> a slot unnecessarily and break the replication.

Sure, an inactive-timeout feature won't break replication, but it's also
not going to be terribly effective against wraparound-related issues. It
seems weird to me to allow an active replication slot to take priority over
imminent storage/XID issues it causes.

> Alvaro, Nathan, do let us know if you would like to discuss more on
> the use case for this new GUC idle_replication_slot_timeout?
> Otherwise, we can proceed with this patch.

I guess I'm not mortally opposed to it. I just think we really need
proper backstops against the storage/XID issues more than we need this one,
and I don't want it to be mistaken for a solution to those problems.

--
nathan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2025-02-17 16:59:17 Re: Add pg_buffercache_evict_all() and pg_buffercache_mark_dirty[_all]() functions
Previous Message Tom Lane 2025-02-17 16:39:11 Re: BUG #18815: Logical replication worker Segmentation fault