Re: promotion related handling in pg_sync_replication_slots()

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: shveta malik <shveta(dot)malik(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
Subject: Re: promotion related handling in pg_sync_replication_slots()
Date: 2024-04-06 06:55:26
Message-ID: CALj2ACW4Tpdh4nsn7RaTB8C2NufVnNZtGYqPNwcbdaH4Ce=Yqw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Apr 5, 2024 at 10:31 AM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
>
> Please find v2. Changes are:

Thanks for the patch. Here are some comments.

1. Can we have a clear saying in the shmem variable who's syncing at
the moment? Is it a slot sync worker or a backend via SQL function?
Perhaps turn "bool syncing;" to "SlotSyncSource sync_source;"

typedef enum SlotSyncSource
{
SLOT_SYNC_NONE,
SLOT_SYNC_WORKER,
SLOT_SYNC_BACKEND,
} SlotSyncSource;

Then, the check in ShutDownSlotSync can be:

+ /*
+ * Return if neither the slot sync worker is running nor the function
+ * pg_sync_replication_slots() is executing.
+ */
+ if ((SlotSyncCtx->pid == InvalidPid) &&
SlotSyncCtx->sync_source != SLOT_SYNC_BACKEND)
{

2.
SyncReplicationSlots(WalReceiverConn *wrconn)
{
+ /*
+ * Startup process signaled the slot sync to stop, so if meanwhile user
+ * has invoked slot sync SQL function, simply return.
+ */
+ SpinLockAcquire(&SlotSyncCtx->mutex);
+ if (SlotSyncCtx->stopSignaled)
+ {
+ ereport(LOG,
+ errmsg("skipping slot synchronization as slot sync
shutdown is signaled during promotion"));
+

Unless I'm missing something, I think this can't detect if the backend
via SQL function is already half-way through syncing in
synchronize_one_slot. So, better move this check to (or also have it
there) slot sync loop that calls synchronize_one_slot. To avoid
spinlock acquisitions, we can perhaps do this check in when we acquire
the spinlock for synced flag.

/* Search for the named slot */
if ((slot = SearchNamedReplicationSlot(remote_slot->name, true)))
{
bool synced;

SpinLockAcquire(&slot->mutex);
synced = slot->data.synced;
<< get SlotSyncCtx->stopSignaled here >>
SpinLockRelease(&slot->mutex);

<< do the slot sync skip check here if (stopSignaled) >>

3. Can we have a test or steps at least to check the consequences
manually one can get if slot syncing via SQL function is happening
during the promotion?

IIUC, we need to ensure there is no backend acquiring it and
performing sync while the slot sync worker is shutting down/standby
promotion is occuring. Otherwise, some of the slots can get resynced
and some are not while we are shutting down the slot sync worker as
part of the standby promotion which might leave the slots in an
inconsistent state.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message jian he 2024-04-06 06:55:32 Re: remaining sql/json patches
Previous Message Amit Kapila 2024-04-06 06:48:34 Re: Introduce XID age and inactive timeout based replication slot invalidation