Re: Synchronizing slots from primary to standby

From: shveta malik <shveta(dot)malik(at)gmail(dot)com>
To: Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>
Cc: "Drouvot, Bertrand" <bertranddrouvot(dot)pg(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Ajin Cherian <itsajin(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject: Re: Synchronizing slots from primary to standby
Date: 2023-12-04 05:10:33
Message-ID: CAJpy0uCBHRX-GSKCeVza44kFEC=uTMD_6uzuXXnbUX32Vt-g8Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Dec 1, 2023 at 5:40 PM Nisha Moond <nisha(dot)moond412(at)gmail(dot)com> wrote:
>
> Review for v41 patch.

Thanks for the feedback.

>
> 1.
> ======
> src/backend/utils/misc/postgresql.conf.sample
>
> +#enable_syncslot = on # enables slot synchronization on the physical
> standby from the primary
>
> enable_syncslot is disabled by default, so, it should be 'off' here.
>

Sure, I will change it.

> ~~~
> 2.
> IIUC, the slotsyncworker's connection to the primary is to execute a
> query. Its aim is not walsender type connection, but at primary when
> queried, the 'backend_type' is set to 'walsender'.
> Snippet from primary db-
>
> datname | usename | application_name | wait_event_type | backend_type
> ---------+-------------+------------------+-----------------+--------------
> postgres | replication | slotsyncworker | Client | walsender
>
> Is it okay?
>

Slot sync worker uses 'libpqrcv_connect' for connection which sends
'replication'-'database' key-value pair as one of the connection
options. And on the primary side, 'ProcessStartupPacket' on the basis
of this key-value pair sets the process as walsender one (am_walsender
= true).
And thus this reflects as backend_type='walsender' in
pg_stat_activity. I do not see any harm in this backend_type for
slot-sync worker currently. This is on a similar line of connections
used for logical-replications. And since a slot-sync worker also deals
with wals-positions (lsns), it is okay to maintain backend_type as
walsender unless you (or others) see any potential issue in doing
that. So let me know.

> ~~~
> 3.
> As per current logic, If there are slots on primary with disabled
> subscriptions, then, when standby is created it replicates these slots
> but can't make them sync-ready until any activity happens on the
> slots.
> So, such slots stay in 'i' sync-state and get dropped when failover
> happens. Now, if the subscriber tries to enable their existing
> subscription after failover, it gives an error that the slot does not
> exist.
>

yes, this is expected as Amit explained in [1]. But let me review if
we need to document this case for disabled subscriptions. i.e.
disabled subscription if enabled after promotion might not work.

> ~~~
> 4. primary_slot_name GUC value test:
>
> When standby is started with a non-existing primary_slot_name, the
> wal-receiver gives an error but the slot-sync worker does not raise
> any error/warning. It is no-op though as it has a check 'if
> (XLogRecPtrIsInvalid(WalRcv->latestWalEnd)) do nothing'. Is this
> okay or shall the slot-sync worker too raise an error and exit?
>
> In another case, when standby is started with valid primary_slot_name,
> but it is changed to some invalid value in runtime, then walreceiver
> starts giving error but the slot-sync worker keeps on running. In this
> case, unlike the previous case, it even did not go to no-op mode (as
> it sees valid WalRcv->latestWalEnd from the earlier run) and keep
> pinging primary repeatedly for slots. Shall here it should error out
> or at least be no-op until we give a valid primary_slot_name?
>

I reviewed it. There is no way to test the existence/validity of
'primary_slot_name' on standby without making a connection to primary.
If primary_slot_name is invalid from the start, slot-sync worker will
be no-op (as you tested) as WalRecv->latestWalENd will be invalid, and
if 'primary_slot_name' is changed to invalid on runtime, slot-sync
worker will still keep on pinging primary. But that should be okay (in
fact needed) as it needs to sync at-least the previous slot's
positions (in case it is delayed in doing so for some reason earlier).
And once the slots are up-to-date on standby, even if worker pings
primary, it will not see any change in slots lsns and thus go for
longer nap. I think, it is not worth the effort to introduce the
complexity of checking validity of 'primary_slot_name' on primary from
standby for this rare scenario.

It will be good to know thoughts of others on above 3 points.

thanks
Shveta

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message John Naylor 2023-12-04 05:12:06 Re: Change GUC hashtable to use simplehash?
Previous Message Michael Paquier 2023-12-04 05:07:06 Re: Bug in pgbench prepared statements