Re: Introduce XID age and inactive timeout based replication slot invalidation

From: shveta malik <shveta(dot)malik(at)gmail(dot)com>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: Peter Smith <smithpb2250(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject: Re: Introduce XID age and inactive timeout based replication slot invalidation
Date: 2024-09-03 09:31:06
Message-ID: CAJpy0uC8Dg-0JS3NRUwVUemgz5Ar2v3_EQQFXyAigWSEQ8U47Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Aug 31, 2024 at 1:45 PM Bharath Rupireddy
<bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:
>
> Hi,
>
>
> Please find the attached v44 patch with the above changes. I will
> include the 0002 xid_age based invalidation patch later.
>

Thanks for the patch Bharath. My review and testing is WIP, but please
find few comments and queries:

1)
I see that ReplicationSlotAlter() will error out if the slot is
invalidated due to timeout. I have not tested it myself, but do you
know if slot-alter errors out for other invalidation causes as well?
Just wanted to confirm that the behaviour is consistent for all
invalidation causes.

2)
When a slot is invalidated, and we try to use that slot, it gives this msg:

ERROR: can no longer get changes from replication slot "mysubnew1_2"
DETAIL: The slot became invalid because it was inactive since
2024-09-03 14:23:34.094067+05:30, which is more than 600 seconds ago.
HINT: You might need to increase "replication_slot_inactive_timeout.".

Isn't HINT misleading? Even if we increase it now, the slot can not be
reused again.

3)
When the slot is invalidated, the' inactive_since' still keeps on
changing when there is a subscriber trying to start replication
continuously. I think ReplicationSlotAcquire() keeps on failing and
thus Release keeps on setting it again and again. Shouldn't we stop
setting/chnaging 'inactive_since' once the slot is invalidated
already, otherwise it will be misleading.

postgres=# select failover,synced,inactive_since,invalidation_reason
from pg_replication_slots;

failover | synced | inactive_since | invalidation_reason
----------+--------+----------------------------------+---------------------
t | f | 2024-09-03 14:23:.. | inactive_timeout

after sometime:
failover | synced | inactive_since | invalidation_reason
----------+--------+----------------------------------+---------------------
t | f | 2024-09-03 14:26:..| inactive_timeout

4)
src/sgml/config.sgml:

4a)
+ A value of zero (which is default) disables the timeout mechanism.

Better will be:
A value of zero (which is default) disables the inactive timeout
invalidation mechanism .
or
A value of zero (which is default) disables the slot invalidation due
to the inactive timeout mechanism.

i.e. rephrase to indicate that invalidation is disabled.

4b)
'synced' and inactive_since should point to pg_replication_slots:

example:
<link linkend="view-pg-replication-slots">pg_replication_slots</link>.<structfield>synced</structfield>

5)
src/sgml/system-views.sgml:
+ ..the slot has been inactive for longer than the duration specified
by replication_slot_inactive_timeout parameter.

Better to have:
..the slot has been inactive for a time longer than the duration
specified by the replication_slot_inactive_timeout parameter.

thanks
Shveta

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2024-09-03 09:41:20 Re: altering a column's collation leaves an invalid foreign key
Previous Message Jehan-Guillaume de Rorthais 2024-09-03 09:26:37 Re: [BUG] Fix DETACH with FK pointing to a partitioned table fails