Re: Synchronizing slots from primary to standby

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: Re: Synchronizing slots from primary to standby
Date: 2024-02-06 11:37:44
Message-ID: CAA4eK1KYAjkRRSV-NAfDq=4GyHpd2igs_8se0uW-LNEF2RbaRA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 6, 2024 at 3:57 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Tue, Feb 6, 2024 at 3:41 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Tue, Feb 6, 2024 at 3:23 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > >
> > > On Tue, Feb 6, 2024 at 1:09 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > > >
> > > > On Tue, Feb 6, 2024 at 3:19 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > > >
> > > > > On Mon, Feb 5, 2024 at 7:56 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > > > > >
> > > > > > ---
> > > > > > Since Two processes (e.g. the slotsync worker and
> > > > > > pg_sync_replication_slots()) concurrently fetch and update the slot
> > > > > > information, there is a race condition where slot's
> > > > > > confirmed_flush_lsn goes backward.
> > > > > >
> > > > >
> > > > > Right, this is possible, though there shouldn't be a problem because
> > > > > anyway, slotsync is an async process. Till we hold restart_lsn, the
> > > > > required WAL won't be removed. Having said that, I can think of two
> > > > > ways to avoid it: (a) We can have some flag in shared memory using
> > > > > which we can detect whether any other process is doing slot
> > > > > syncronization and then either error out at that time or simply wait
> > > > > or may take nowait kind of parameter from user to decide what to do?
> > > > > If this is feasible, we can simply error out for the first version and
> > > > > extend it later if we see any use cases for the same (b) similar to
> > > > > restart_lsn, if confirmed_flush_lsn is getting moved back, raise an
> > > > > error, this is good for now but in future we may still have another
> > > > > similar issue, so I would prefer (a) among these but I am fine if you
> > > > > prefer (b) or have some other ideas like just note down in comments
> > > > > that this is a harmless case and can happen only very rarely.
> > > >
> > > > Thank you for sharing the ideas. I would prefer (a). For (b), the same
> > > > issue still happens for other fields.
> > >
> > > I agree that (a) looks better. On a separate note, while looking at
> > > this API pg_sync_replication_slots(PG_FUNCTION_ARGS) shouldn't there
> > > be an optional parameter to give one slot or multiple slots or all
> > > slots as default, that will give better control to the user no?
> > >
> >
> > As of now, we want to give functionality similar to slotsync worker
> > with a difference that users can use this new function for planned
> > switchovers. So, syncing all failover slots by default. I think if
> > there is a use case to selectively sync some of the failover slots
> > then we can probably extend this function and slotsync worker as well.
> > Normally, if the primary goes down due to whatever reason users would
> > want to restart the replication for all the defined publications via
> > existing failover slots. Why would anyone want to do it partially?
>
> If we consider the usability of such a function (I mean as it is
> implemented now, without any argument) one use case could be that if
> the slot sync worker is not keeping up or at some point in time the
> user doesn't want to wait for the worker to do this instead user can
> do it by himself.
>

Possibly, but I was imagining that it would be used for planned
switchover cases and also for testing the core sync slot functionality
in our TAP tests.

> So now if we have such a functionality then it would be even better to
> extend it to selectively sync the slot. For example, if there is some
> issue in syncing all slots, maybe some bug or taking a long time to
> sync because there are a lot of slots but if the user needs to quickly
> failover and he/she is interested in only a couple of slots then such
> a option could be helpful. no?
>

I see your point but not sure how useful it is in the field. I am fine
if others also think such a parameter will be useful and anyway I
think we can even extend it after v1 is done.

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ильясов Ян 2024-02-06 11:45:31 RE: Memory leak fix in rmtree.c
Previous Message Dilip Kumar 2024-02-06 11:35:16 Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock