From: | Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com> |
---|---|
To: | Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com> |
Cc: | shveta malik <shveta(dot)malik(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Introduce XID age and inactive timeout based replication slot invalidation |
Date: | 2024-03-31 04:55:46 |
Message-ID: | CALj2ACVVb1Uds+ucw_nq03uoGB5NToVi5hSiwW+S_goju_sncg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Mar 28, 2024 at 3:13 PM Bertrand Drouvot
<bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
>
> Regarding 0002:
Thanks for reviewing it.
> Some testing:
>
> T1 ===
>
> When the slot is invalidated on the primary, then the reason is propagated to
> the sync slot (if any). That's fine but we are loosing the inactive_since on the
> standby:
>
> Primary:
>
> postgres=# select slot_name,inactive_since,conflicting,invalidation_reason from pg_replication_slots where slot_name='lsub29_slot';
> slot_name | inactive_since | conflicting | invalidation_reason
> -------------+-------------------------------+-------------+---------------------
> lsub29_slot | 2024-03-28 08:24:51.672528+00 | f | inactive_timeout
> (1 row)
>
> Standby:
>
> postgres=# select slot_name,inactive_since,conflicting,invalidation_reason from pg_replication_slots where slot_name='lsub29_slot';
> slot_name | inactive_since | conflicting | invalidation_reason
> -------------+----------------+-------------+---------------------
> lsub29_slot | | f | inactive_timeout
> (1 row)
>
> I think in this case it should always reflect the value from the primary (so
> that one can understand why it is invalidated).
I'll come back to this as soon as we all agree on inactive_since
behavior for synced slots.
> T2 ===
>
> And it is set to a value during promotion:
>
> postgres=# select pg_promote();
> pg_promote
> ------------
> t
> (1 row)
>
> postgres=# select slot_name,inactive_since,conflicting,invalidation_reason from pg_replication_slots where slot_name='lsub29_slot';
> slot_name | inactive_since | conflicting | invalidation_reason
> -------------+------------------------------+-------------+---------------------
> lsub29_slot | 2024-03-28 08:30:11.74505+00 | f | inactive_timeout
> (1 row)
>
> I think when it is invalidated it should always reflect the value from the
> primary (so that one can understand why it is invalidated).
I'll come back to this as soon as we all agree on inactive_since
behavior for synced slots.
> T3 ===
>
> As far the slot invalidation on the primary:
>
> postgres=# SELECT * FROM pg_logical_slot_get_changes('lsub29_slot', NULL, NULL, 'include-xids', '0');
> ERROR: cannot acquire invalidated replication slot "lsub29_slot"
>
> Can we make the message more consistent with what can be found in CreateDecodingContext()
> for example?
Hm, that makes sense because slot acquisition and release is something
internal to the server.
> T4 ===
>
> Also, it looks like querying pg_replication_slots() does not trigger an
> invalidation: I think it should if the slot is not invalidated yet (and matches
> the invalidation criteria).
There's a different opinion on this, check comment #3 from
https://www.postgresql.org/message-id/CAA4eK1LLj%2BeaMN-K8oeOjfG%2BUuzTY%3DL5PXbcMJURZbFm%2B_aJSA%40mail.gmail.com.
> Code review:
>
> CR1 ===
>
> + Invalidate replication slots that are inactive for longer than this
> + amount of time. If this value is specified without units, it is taken
>
> s/Invalidate/Invalidates/?
Done.
> Should we mention the relationship with inactive_since?
Done.
> CR2 ===
>
> + *
> + * If check_for_invalidation is true, the slot is checked for invalidation
> + * based on replication_slot_inactive_timeout GUC and an error is raised after making the slot ours.
> */
> void
> -ReplicationSlotAcquire(const char *name, bool nowait)
> +ReplicationSlotAcquire(const char *name, bool nowait,
> + bool check_for_invalidation)
>
>
> s/check_for_invalidation/check_for_timeout_invalidation/?
Done.
> CR3 ===
>
> + if (slot->inactive_since == 0 ||
> + replication_slot_inactive_timeout == 0)
> + return false;
>
> Better to test replication_slot_inactive_timeout first? (I mean there is no
> point of testing inactive_since if replication_slot_inactive_timeout == 0)
>
> CR4 ===
>
> + if (slot->inactive_since > 0 &&
> + replication_slot_inactive_timeout > 0)
> + {
>
> Same.
>
> So, instead of CR3 === and CR4 ===, I wonder if it wouldn't be better to do
> something like:
>
> if (replication_slot_inactive_timeout == 0)
> return false;
> else if (slot->inactive_since > 0)
> .
> else
> return false;
>
> That would avoid checking replication_slot_inactive_timeout and inactive_since
> multiple times.
Done.
> CR5 ===
>
> + * held to avoid race conditions -- for example the restart_lsn could move
> + * forward, or the slot could be dropped.
>
> Does the restart_lsn example makes sense here?
No, it doesn't. Modified that.
> CR6 ===
>
> +static bool
> +InvalidateSlotForInactiveTimeout(ReplicationSlot *slot, bool need_locks)
> +{
>
> InvalidatePossiblyInactiveSlot() maybe?
I think we will lose the essence i.e. timeout from the suggested
function name, otherwise just the inactive doesn't give a clearer
meaning. I kept it that way unless anyone suggests otherwise.
> CR7 ===
>
> + /* Make sure the invalidated state persists across server restart */
> + slot->just_dirtied = true;
> + slot->dirty = true;
> + SpinLockRelease(&slot->mutex);
>
> Maybe we could create a new function say MarkGivenReplicationSlotDirty()
> with a slot as parameter, that ReplicationSlotMarkDirty could call too?
Done that.
> Then maybe we could set slot->data.invalidated = RS_INVAL_INACTIVE_TIMEOUT in
> InvalidateSlotForInactiveTimeout()? (to avoid multiple SpinLockAcquire/SpinLockRelease).
Done that.
> CR8 ===
>
> + if (persist_state)
> + {
> + char path[MAXPGPATH];
> +
> + sprintf(path, "pg_replslot/%s", NameStr(slot->data.name));
> + SaveSlotToPath(slot, path, ERROR);
> + }
>
> Maybe we could create a new function say GivenReplicationSlotSave()
> with a slot as parameter, that ReplicationSlotSave() could call too?
Done that.
> CR9 ===
>
> + if (check_for_invalidation)
> + {
> + /* The slot is ours by now */
> + Assert(s->active_pid == MyProcPid);
> +
> + /*
> + * Well, the slot is not yet ours really unless we check for the
> + * invalidation below.
> + */
> + s->active_pid = 0;
> + if (InvalidateReplicationSlotForInactiveTimeout(s, true, true))
> + {
> + /*
> + * If the slot has been invalidated, recalculate the resource
> + * limits.
> + */
> + ReplicationSlotsComputeRequiredXmin(false);
> + ReplicationSlotsComputeRequiredLSN();
> +
> + /* Might need it for slot clean up on error, so restore it */
> + s->active_pid = MyProcPid;
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("cannot acquire invalidated replication slot \"%s\"",
> + NameStr(MyReplicationSlot->data.name))));
> + }
> + s->active_pid = MyProcPid;
>
> Are we not missing some SpinLockAcquire/Release on the slot's mutex here? (the
> places where we set the active_pid).
Hm, yes. But, shall I acquire the mutex, set active_pid to 0 for a
moment just to satisfy Assert(slot->active_pid == 0); in
InvalidateReplicationSlotForInactiveTimeout and
InvalidateSlotForInactiveTimeout? I just removed the assertions
because being replication_slot_inactive_timeout > 0 and inactive_since
> 0 is enough for these functions to think and decide on inactive
timeout invalidation.
> CR10 ===
>
> @@ -1628,6 +1674,10 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
> if (SlotIsLogical(s))
> invalidation_cause = cause;
> break;
> + case RS_INVAL_INACTIVE_TIMEOUT:
> + if (InvalidateReplicationSlotForInactiveTimeout(s, false, false))
> + invalidation_cause = cause;
> + break;
>
> InvalidatePossiblyObsoleteSlot() is not called with such a reason, better to use
> an Assert here and in the caller too?
Done.
> CR11 ===
>
> +++ b/src/test/recovery/t/050_invalidate_slots.pl
>
> why not using 019_replslot_limit.pl?
I understand that 019_replslot_limit covers wal_removed related
invalidations. But, I don't want to kludge it with a bunch of other
tests. The new tests anyway need a bunch of new nodes and a couple of
helper functions. Any future invalidation mechanisms can be added here
in this new file. Also, having a separate file quickly helps isolate
any test failures that BF animals might report in future. I don't
think a separate test file here hurts anyone unless there's a strong
reason against it.
Please see the attached v30 patch. 0002 is where all of the above
review comments have been addressed.
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment | Content-Type | Size |
---|---|---|
v30-0001-Maintain-inactive_since-for-synced-slots-correct.patch | application/x-patch | 16.5 KB |
v30-0002-Add-inactive_timeout-based-replication-slot-inva.patch | application/x-patch | 31.5 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Noah Misch | 2024-03-31 05:03:10 | Re: Query execution in Perl TAP tests needs work |
Previous Message | Thomas Munro | 2024-03-31 04:46:10 | Re: pg_combinebackup --copy-file-range |