Re: Introduce XID age and inactive timeout based replication slot invalidation

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: shveta malik <shveta(dot)malik(at)gmail(dot)com>
Cc: Peter Smith <smithpb2250(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Introduce XID age and inactive timeout based replication slot invalidation
Date: 2024-09-08 11:55:42
Message-ID: CALj2ACWqLneo2kqP5DEUwDXyHoM5zPn2vE+27oB+Cse9Rvzo_A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Thanks for reviewing.

On Tue, Sep 3, 2024 at 3:01 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
>
> 1)
> I see that ReplicationSlotAlter() will error out if the slot is
> invalidated due to timeout. I have not tested it myself, but do you
> know if slot-alter errors out for other invalidation causes as well?
> Just wanted to confirm that the behaviour is consistent for all
> invalidation causes.

Will respond to Amit's comment soon.

> 2)
> When a slot is invalidated, and we try to use that slot, it gives this msg:
>
> ERROR: can no longer get changes from replication slot "mysubnew1_2"
> DETAIL: The slot became invalid because it was inactive since
> 2024-09-03 14:23:34.094067+05:30, which is more than 600 seconds ago.
> HINT: You might need to increase "replication_slot_inactive_timeout.".
>
> Isn't HINT misleading? Even if we increase it now, the slot can not be
> reused again.
>
> Below is one side effect if inactive_since keeps on changing:
>
> postgres=# SELECT * FROM pg_replication_slot_advance('mysubnew1_1',
> pg_current_wal_lsn());
> ERROR: can no longer get changes from replication slot "mysubnew1_1"
> DETAIL: The slot became invalid because it was inactive since
> 2024-09-04 10:03:56.68053+05:30, which is more than 10 seconds ago.
> HINT: You might need to increase "replication_slot_inactive_timeout.".
>
> postgres=# select now();
> now
> ---------------------------------
> 2024-09-04 10:04:00.26564+05:30
>
> 'DETAIL' gives wrong information, we are not past 10-seconds. This is
> because inactive_since got updated even in ERROR scenario.
>
> ERROR: can no longer get changes from replication slot "mysubnew1_1"
> DETAIL: The slot became invalid because it was inactive since
> 2024-09-04 10:06:38.980939+05:30, which is more than 129600 seconds
> ago.
> postgres=# select now();
> now
> ----------------------------------
> 2024-09-04 10:07:35.201894+05:30
>
> I feel we should change this message itself.

Removed the hint and corrected the detail message as following:

errmsg("can no longer get changes from replication slot \"%s\"",
NameStr(s->data.name)),
errdetail("This slot has been invalidated because it was inactive for
longer than the amount of time specified by \"%s\".",
"replication_slot_inactive_timeout.")));

> 3)
> When the slot is invalidated, the' inactive_since' still keeps on
> changing when there is a subscriber trying to start replication
> continuously. I think ReplicationSlotAcquire() keeps on failing and
> thus Release keeps on setting it again and again. Shouldn't we stop
> setting/chnaging 'inactive_since' once the slot is invalidated
> already, otherwise it will be misleading.
>
> postgres=# select failover,synced,inactive_since,invalidation_reason
> from pg_replication_slots;
>
> failover | synced | inactive_since | invalidation_reason
> ----------+--------+----------------------------------+---------------------
> t | f | 2024-09-03 14:23:.. | inactive_timeout
>
> after sometime:
> failover | synced | inactive_since | invalidation_reason
> ----------+--------+----------------------------------+---------------------
> t | f | 2024-09-03 14:26:..| inactive_timeout

Changed it to not update inactive_since for slots invalidated due to
inactive timeout.

> 4)
> src/sgml/config.sgml:
>
> 4a)
> + A value of zero (which is default) disables the timeout mechanism.
>
> Better will be:
> A value of zero (which is default) disables the inactive timeout
> invalidation mechanism .

Changed.

> 4b)
> 'synced' and inactive_since should point to pg_replication_slots:
>
> example:
> <link linkend="view-pg-replication-slots">pg_replication_slots</link>.<structfield>synced</structfield>

Modified.

> 5)
> src/sgml/system-views.sgml:
> + ..the slot has been inactive for longer than the duration specified
> by replication_slot_inactive_timeout parameter.
>
> Better to have:
> ..the slot has been inactive for a time longer than the duration
> specified by the replication_slot_inactive_timeout parameter.

Changed it to the following to be consistent with the config.sgml.

<literal>inactive_timeout</literal> means that the slot has been
inactive for longer than the amount of time specified by the
<xref linkend="guc-replication-slot-inactive-timeout"/> parameter.

Please find the v45 patch posted upthread at
https://www.postgresql.org/message-id/CALj2ACWXQT3_HY40ceqKf1DadjLQP6b1r%3D0sZRh-xhAOd-b0pA%40mail.gmail.com
for the changes.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Stepan Neretin 2024-09-08 12:59:52 Re: Sort functions with specialized comparators
Previous Message Bharath Rupireddy 2024-09-08 11:54:47 Re: Introduce XID age and inactive timeout based replication slot invalidation