From: | Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com> |
---|---|
To: | Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com> |
Cc: | shveta malik <shveta(dot)malik(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Introduce XID age and inactive timeout based replication slot invalidation |
Date: | 2024-04-03 08:17:05 |
Message-ID: | Zg0QgarM9UlRutLw@ip-10-97-1-34.eu-west-3.compute.internal |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On Wed, Apr 03, 2024 at 11:17:41AM +0530, Bharath Rupireddy wrote:
> On Wed, Apr 3, 2024 at 8:38 AM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
> >
> > > > Or a simple solution is that the slotsync worker updates
> > > > inactive_since as it does for non-synced slots, and disables
> > > > timeout-based slot invalidation for synced slots.
> >
> > I like this idea better, it takes care of such a case too when the
> > user is relying on sync-function rather than worker and does not want
> > to get the slots invalidated in between 2 sync function calls.
>
> Please find the attached v31 patches implementing the above idea:
Thanks!
Some comments regarding v31-0002:
=== testing the behavior
T1 ===
> - synced slots don't get invalidated due to inactive timeout because
> such slots not considered active at all as they don't perform logical
> decoding (of course, they will perform in fast_forward mode to fix the
> other data loss issue, but they don't generate changes for them to be
> called as *active* slots)
It behaves as described. OTOH non synced logical slots on the standby and
physical slots on the standby are invalidated which is what is expected.
T2 ===
In case the slot is invalidated on the primary,
primary:
postgres=# select slot_name, inactive_since, invalidation_reason from pg_replication_slots where slot_name = 's1';
slot_name | inactive_since | invalidation_reason
-----------+-------------------------------+---------------------
s1 | 2024-04-03 06:56:28.075637+00 | inactive_timeout
then on the standby we get:
standby:
postgres=# select slot_name, inactive_since, invalidation_reason from pg_replication_slots where slot_name = 's1';
slot_name | inactive_since | invalidation_reason
-----------+------------------------------+---------------------
s1 | 2024-04-03 07:06:43.37486+00 | inactive_timeout
shouldn't the slot be dropped/recreated instead of updating inactive_since?
=== code
CR1 ===
+ Invalidates replication slots that are inactive for longer the
+ specified amount of time
s/for longer the/for longer that/?
CR2 ===
+ <literal>true</literal>) as such synced slots don't actually perform
+ logical decoding.
We're switching in fast forward logical due to [1], so I'm not sure that's 100%
accurate here. I'm not sure we need to specify a reason.
CR3 ===
+ errdetail("This slot has been invalidated because it was inactive for more than the time specified by replication_slot_inactive_timeout parameter.")));
I think we can remove "parameter" (see for example the error message in
validate_remote_info()) and reduce it a bit, something like?
"This slot has been invalidated because it was inactive for more than replication_slot_inactive_timeout"?
CR4 ===
+ appendStringInfoString(&err_detail, _("The slot has been inactive for more than the time specified by replication_slot_inactive_timeout parameter."));
Same.
CR5 ===
+ /*
+ * This function isn't expected to be called for inactive timeout based
+ * invalidation. A separate function InvalidateInactiveReplicationSlot is
+ * to be used for that.
Do you think it's worth to explain why?
CR6 ===
+ if (replication_slot_inactive_timeout == 0)
+ return false;
+ else if (slot->inactive_since > 0)
"else" is not needed here.
CR7 ===
+ SpinLockAcquire(&slot->mutex);
+
+ /*
+ * Check if the slot needs to be invalidated due to
+ * replication_slot_inactive_timeout GUC. We do this with the spinlock
+ * held to avoid race conditions -- for example the inactive_since
+ * could change, or the slot could be dropped.
+ */
+ now = GetCurrentTimestamp();
We should not call GetCurrentTimestamp() while holding a spinlock.
CR8 ===
+# Testcase start: Invalidate streaming standby's slot as well as logical
+# failover slot on primary due to inactive timeout GUC. Also, check the logical
s/inactive timeout GUC/replication_slot_inactive_timeout/?
CR9 ===
+# Start: Helper functions used for this test file
+# End: Helper functions used for this test file
I think that's the first TAP test with this comment. Not saying we should not but
why did you feel the need to add those?
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From | Date | Subject | |
---|---|---|---|
Next Message | Daniel Gustafsson | 2024-04-03 08:33:10 | Re: RFC: Additional Directory for Extensions |
Previous Message | Daniel Gustafsson | 2024-04-03 07:58:29 | Re: [HACKERS] make async slave to wait for lsn to be replayed |