Re: Introduce XID age and inactive timeout based replication slot invalidation

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc: shveta malik <shveta(dot)malik(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Introduce XID age and inactive timeout based replication slot invalidation
Date: 2024-03-27 04:38:33
Message-ID: CALj2ACX-M+Lx6Qv4id7HtqqTKVKA_ti4whK9cgpTPHdHLi943w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 26, 2024 at 11:22 PM Bertrand Drouvot
<bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
>
> > I'm attaching v24 patches. It implements the above idea proposed
> > upthread for synced slots.
>
> ==== v24-0002
>
> 1 ===
>
> This commit does two things:
> 1) Updates inactive_since for sync slots with the value
> received from the primary's slot.
>
> Tested it and it does that.

Thanks. I've added a test case for this.

> 2 ===
>
> 2) Ensures the value is set to current timestamp during the
> shutdown of slot sync machinery to help correctly interpret the
> time if the standby gets promoted without a restart.
>
> Tested it and it does that.

Thanks. I've added a test case for this.

> 3 ===
>
> +/*
> + * Reset the synced slots info such as inactive_since after shutting
> + * down the slot sync machinery.
> + */
> +static void
> +update_synced_slots_inactive_time(void)
>
> Looks like the comment "reset" is not matching the name of the function and
> what it does.

Changed. I've also changed the function name to
update_synced_slots_inactive_since to be precise on what it exactly
does.

> 4 ===
>
> + /*
> + * We get the current time beforehand and only once to avoid
> + * system calls overhead while holding the lock.
> + */
> + if (now == 0)
> + now = GetCurrentTimestamp();
>
> Also +1 of having GetCurrentTimestamp() just called one time within the loop.

Right.

> 5 ===
>
> - if (!(RecoveryInProgress() && slot->data.synced))
> + if (!(InRecovery && slot->data.synced))
> slot->inactive_since = GetCurrentTimestamp();
> else
> slot->inactive_since = 0;
>
> Not related to this change but more the way RestoreSlotFromDisk() behaves here:
>
> For a sync slot on standby it will be set to zero and then later will be
> synchronized with the one coming from the primary. I think that's fine to have
> it to zero for this window of time.

Right.

> Now, if the standby is down and one sets sync_replication_slots to off,
> then inactive_since will be set to zero on the standby at startup and not
> synchronized (unless one triggers a manual sync). I also think that's fine but
> it might be worth to document this behavior (that after a standby startup
> inactive_since is zero until the next sync...).

Isn't this behaviour applicable for other slot parameters that the
slot syncs from the remote slot on the primary?

I've added the following note in the comments when we update
inactive_since in RestoreSlotFromDisk.

* Note that for synced slots after the standby starts up (i.e. after
* the slots are loaded from the disk), the inactive_since will remain
* zero until the next slot sync cycle.
*/
if (!(InRecovery && slot->data.synced))
slot->inactive_since = GetCurrentTimestamp();
else
slot->inactive_since = 0;

> 6 ===
>
> + print "HI $slot_name $name $inactive_since $slot_creation_time\n";
>
> garbage?

Removed.

> 7 ===
>
> +# Capture and validate inactive_since of a given slot.
> +sub capture_and_validate_slot_inactive_since
> +{
> + my ($node, $slot_name, $slot_creation_time) = @_;
> + my $name = $node->name;
>
> We know have capture_and_validate_slot_inactive_since at 2 places:
> 040_standby_failover_slots_sync.pl and 019_replslot_limit.pl.
>
> Worth to create a sub in Cluster.pm?

I'd second that thought for now. We might have to debate first if it's
useful for all the nodes even without replication, and if yes, the
naming stuff and all that. Historically, we've had such duplicated
functions until recently, for instance advance_wal and log_contains.
We
moved them over to a common perl library Cluster.pm very recently. I'm
sure we can come back later to move it to Cluster.pm.

On Wed, Mar 27, 2024 at 9:02 AM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
>
> 1)
> slot.c:
> + * data from the remote slot. We use InRecovery flag instead of
> + * RecoveryInProgress() as it always returns true even for normal
> + * server startup.
>
> a) Not clear what 'it' refers to. Better to use 'the latter'
> b) Is it better to mention the primary here:
> 'as the latter always returns true even on the primary server during startup'.

Modified.

> 2)
> update_local_synced_slot():
>
> - strcmp(remote_slot->plugin, NameStr(slot->data.plugin)) == 0)
> + strcmp(remote_slot->plugin, NameStr(slot->data.plugin)) == 0 &&
> + remote_slot->inactive_since == slot->inactive_since)
>
> When this code was written initially, the intent was to do strcmp at
> the end (only if absolutely needed). It will be good if we maintain
> the same and add new checks before strcmp.

Done.

> 3)
> update_synced_slots_inactive_time():
>
> This assert is removed, is it intentional?
> Assert(s->active_pid == 0);

Yes, the slot can get acquired in the corner case when someone runs
pg_sync_replication_slots concurrently at this time. I'm referring to
the issue reported upthread. We don't prevent one running
pg_sync_replication_slots in promotion/ShutDownSlotSync phase right?
Maybe we should prevent that otherwise some of the slots are synced
and the standby gets promoted while others are yet-to-be-synced.

> 4)
> 040_standby_failover_slots_sync.pl:
>
> +# Capture the inactive_since of the slot from the standby the logical failover
> +# slots are synced/created on the standby.
>
> The comment is unclear, something seems missing.

Nice catch. Yes, that was wrong. I've modified it now.

Please find the attached v25-0001 (made this 0001 patch now as
inactive_since patch is committed) patch with the above changes.
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v25-0001-Maintain-inactive_since-for-synced-slots-correct.patch application/x-patch 13.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2024-03-27 04:40:05 Re: pgsql: Track last_inactive_time in pg_replication_slots.
Previous Message Amit Langote 2024-03-27 04:34:50 Re: remaining sql/json patches