Re: Synchronizing slots from primary to standby

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: Re: Synchronizing slots from primary to standby
Date: 2024-02-06 10:26:45
Message-ID: CAFiTN-tmYtuaoge84YUZdzokCn+tJPxkiS4fsTbEtHQCcP2uDg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 6, 2024 at 3:41 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Tue, Feb 6, 2024 at 3:23 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Tue, Feb 6, 2024 at 1:09 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > >
> > > On Tue, Feb 6, 2024 at 3:19 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > >
> > > > On Mon, Feb 5, 2024 at 7:56 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > > > >
> > > > > ---
> > > > > Since Two processes (e.g. the slotsync worker and
> > > > > pg_sync_replication_slots()) concurrently fetch and update the slot
> > > > > information, there is a race condition where slot's
> > > > > confirmed_flush_lsn goes backward.
> > > > >
> > > >
> > > > Right, this is possible, though there shouldn't be a problem because
> > > > anyway, slotsync is an async process. Till we hold restart_lsn, the
> > > > required WAL won't be removed. Having said that, I can think of two
> > > > ways to avoid it: (a) We can have some flag in shared memory using
> > > > which we can detect whether any other process is doing slot
> > > > syncronization and then either error out at that time or simply wait
> > > > or may take nowait kind of parameter from user to decide what to do?
> > > > If this is feasible, we can simply error out for the first version and
> > > > extend it later if we see any use cases for the same (b) similar to
> > > > restart_lsn, if confirmed_flush_lsn is getting moved back, raise an
> > > > error, this is good for now but in future we may still have another
> > > > similar issue, so I would prefer (a) among these but I am fine if you
> > > > prefer (b) or have some other ideas like just note down in comments
> > > > that this is a harmless case and can happen only very rarely.
> > >
> > > Thank you for sharing the ideas. I would prefer (a). For (b), the same
> > > issue still happens for other fields.
> >
> > I agree that (a) looks better. On a separate note, while looking at
> > this API pg_sync_replication_slots(PG_FUNCTION_ARGS) shouldn't there
> > be an optional parameter to give one slot or multiple slots or all
> > slots as default, that will give better control to the user no?
> >
>
> As of now, we want to give functionality similar to slotsync worker
> with a difference that users can use this new function for planned
> switchovers. So, syncing all failover slots by default. I think if
> there is a use case to selectively sync some of the failover slots
> then we can probably extend this function and slotsync worker as well.
> Normally, if the primary goes down due to whatever reason users would
> want to restart the replication for all the defined publications via
> existing failover slots. Why would anyone want to do it partially?

If we consider the usability of such a function (I mean as it is
implemented now, without any argument) one use case could be that if
the slot sync worker is not keeping up or at some point in time the
user doesn't want to wait for the worker to do this instead user can
do it by himself.

So now if we have such a functionality then it would be even better to
extend it to selectively sync the slot. For example, if there is some
issue in syncing all slots, maybe some bug or taking a long time to
sync because there are a lot of slots but if the user needs to quickly
failover and he/she is interested in only a couple of slots then such
a option could be helpful. no?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2024-02-06 10:37:31 Re: Synchronizing slots from primary to standby
Previous Message Shlok Kyal 2024-02-06 10:26:21 Re: speed up a logical replica setup