Re: Introduce XID age and inactive timeout based replication slot invalidation

From: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: shveta malik <shveta(dot)malik(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Introduce XID age and inactive timeout based replication slot invalidation
Date: 2024-04-03 16:27:45
Message-ID: Zg2DgfXorAXrC8lw@ip-10-97-1-34.eu-west-3.compute.internal
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Wed, Apr 03, 2024 at 08:28:04PM +0530, Bharath Rupireddy wrote:
> On Wed, Apr 3, 2024 at 6:46 PM Bertrand Drouvot
> <bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
> >
> > Just one comment on v32-0001:
> >
> > +# Synced slot on the standby must get its own inactive_since.
> > +is( $standby1->safe_psql(
> > + 'postgres',
> > + "SELECT '$inactive_since_on_primary'::timestamptz <= '$inactive_since_on_standby'::timestamptz AND
> > + '$inactive_since_on_standby'::timestamptz <= '$slot_sync_time'::timestamptz;"
> > + ),
> > + "t",
> > + 'synchronized slot has got its own inactive_since');
> > +
> >
> > By using <= we are not testing that it must get its own inactive_since (as we
> > allow them to be equal in the test). I think we should just add some usleep()
> > where appropriate and deny equality during the tests on inactive_since.
>
> > Except for the above, v32-0001 LGTM.
>
> Thanks. Please see the attached v33-0001 patch after removing equality
> on inactive_since TAP tests.

Thanks! v33-0001 LGTM.

> On Wed, Apr 3, 2024 at 1:47 PM Bertrand Drouvot
> <bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
> > Some comments regarding v31-0002:
> >
> > T2 ===
> >
> > In case the slot is invalidated on the primary,
> >
> > primary:
> >
> > postgres=# select slot_name, inactive_since, invalidation_reason from pg_replication_slots where slot_name = 's1';
> > slot_name | inactive_since | invalidation_reason
> > -----------+-------------------------------+---------------------
> > s1 | 2024-04-03 06:56:28.075637+00 | inactive_timeout
> >
> > then on the standby we get:
> >
> > standby:
> >
> > postgres=# select slot_name, inactive_since, invalidation_reason from pg_replication_slots where slot_name = 's1';
> > slot_name | inactive_since | invalidation_reason
> > -----------+------------------------------+---------------------
> > s1 | 2024-04-03 07:06:43.37486+00 | inactive_timeout
> >
> > shouldn't the slot be dropped/recreated instead of updating inactive_since?
>
> The sync slots that are invalidated on the primary aren't dropped and
> recreated on the standby.

Yeah, right (I was confused with synced slots that are invalidated locally).

> However, I
> found that the synced slot is acquired and released unnecessarily
> after the invalidation_reason is synced from the primary. I added a
> skip check in synchronize_one_slot to skip acquiring and releasing the
> slot if it's locally found inactive. With this, inactive_since won't
> get updated for invalidated sync slots on the standby as we don't
> acquire and release the slot.

CR1 ===

Yeah, I can see:

@@ -575,6 +575,13 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid remote_dbid)
" name slot \"%s\" already exists on the standby",
remote_slot->name));

+ /*
+ * Skip the sync if the local slot is already invalidated. We do this
+ * beforehand to save on slot acquire and release.
+ */
+ if (slot->data.invalidated != RS_INVAL_NONE)
+ return false;

Thanks to the drop_local_obsolete_slots() call I think we are not missing the case
where the slot has been invalidated on the primary, invalidation reason has been
synced on the standby and later the slot is dropped/ recreated manually on the
primary (then it should be dropped/recreated on the standby too).

Also it seems we are not missing the case where a sync slot is invalidated
locally due to wal removal (it should be dropped/recreated).

>
> > CR5 ===
> >
> > + /*
> > + * This function isn't expected to be called for inactive timeout based
> > + * invalidation. A separate function InvalidateInactiveReplicationSlot is
> > + * to be used for that.
> >
> > Do you think it's worth to explain why?
>
> Hm, I just wanted to point out the actual function here. I modified it
> to something like the following, if others feel we don't need that, I
> can remove it.

Sorry If I was not clear but I meant to say "Do you think it's worth to explain
why we decided to create a dedicated function"? (currently we "just" explain that
we created one).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrey M. Borodin 2024-04-03 16:31:29 Re: [PATCH] Modify pg_ctl to detect presence of geek user
Previous Message Alvaro Herrera 2024-04-03 16:01:28 Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock