Re: Synchronizing slots from primary to standby

From: "Drouvot, Bertrand" <bertranddrouvot(dot)pg(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: shveta malik <shveta(dot)malik(at)gmail(dot)com>, "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Ajin Cherian <itsajin(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: Re: Synchronizing slots from primary to standby
Date: 2023-11-23 15:45:05
Message-ID: 1c4691b6-787c-4b02-adf3-d5865b12820f@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 11/23/23 6:13 AM, Amit Kapila wrote:
> On Tue, Nov 21, 2023 at 4:35 PM Drouvot, Bertrand
> <bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
>>
>> On 11/21/23 10:32 AM, shveta malik wrote:
>>> On Tue, Nov 21, 2023 at 2:02 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
>>>>
>>
>>> v37 fails to apply to HEAD due to a recent commit e83aa9f92fdd,
>>> rebased the patches. PFA v37_2 patches.
>>
>> Thanks!
>>
>> Regarding the promotion flow: If the primary is available and reachable I don't
>> think we currently try to ensure that slots are in sync. I think we'd miss the
>> activity since the last sync and the promotion request or am I missing something?
>>
>> If the primary is available and reachable shouldn't we launch a last round of
>> synchronization (skipping all the slots that are not in 'r' state)?
>>
>
> We may miss the last round but there is no guarantee that we can
> ensure to sync of everything if the primary is available. Because
> after our last sync, there could probably be some more activity.

I don't think so thanks to the fact that we ensure that logical walsenders
on the primary wait for the physical standby.

Indeed that should prevent any decoding activity on the primary while the
promotion is in progress on the standby (at least as soon as the
walreceiver is shutdown).

So that I think that a promotion flow like:

- walreceiver shutdown
- last round of sync
- sync-worker shutdown

Should ensure that slots are in sync (as logical slots on the primary
should not be able to advance as soon as the walreceiver is shutdown
during the promotion).

> I think it is the user's responsibility to promote a new primary when
> the old one is not required for some reason.

Do you mean they should ensure something like?

1. no more activity on the primary
2. check that the slots are in sync with the primary
3. promote

but then they could also (without the new feature we're building):

1. create and advance slots manually (pg_replication_slot_advance) on the standby
to sync them up at regular interval

and then before promotion:

2. ensure no more activity on the primary
3. last round of advance slots manually
3. promote

I think that ensuring the slots are in sync during promotion (should the primary
be available) would provide added value as compared to the above scenarios.

> It is not only slots that
> can be out of sync but even we can miss fetching some of the data. I
> think this is quite similar to what we do for WAL where on finding the
> promotion signal, we shut down Walreceiver and just replay any WAL
> that was already received by walreceiver.

> Also, the promotion
> shouldn't create any problem w.r.t subscribers connecting to the new
> primary because the slot's position is slightly behind what could be
> requested by subscribers which means the corresponding data will be
> available on the new primary.
>

Right.

> Do you have something in mind that can create any problem if we don't
> attempt additional fetching round after the promotion signal is
> received?

It's not a "real" problem per say, but in case of non synced slot, I can see 2 cases:

- publisher/subscriber case: I don't see any problem here, since after
an "alter subscription XXX connection '<new_primary>'" logical replication should
start from the right place thanks to the replication origin associated to the
subscription.

- non publisher/subscriber case (say pg_recvlogical that does not make use of
replication origin) then:

a) data since the last sync and promotion would be decoded again
unless b) or c)
b) user manually advances the slot on the standby after promotion
c) user restarts the decoding with an appropriate --startpos option

That's for this non publisher/subscriber case that I think it would be
beneficial to try to ensure that the slots are in sync during the promotion.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2023-11-23 16:12:30 Re: [HACKERS] psql casts aspersions on server reliability
Previous Message jacktby jacktby 2023-11-23 14:48:45 Does pg support to write a new buffer cache as an extension?