From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(at)vondra(dot)me> |
Cc: | Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4 |
Date: | 2024-11-15 03:26:17 |
Message-ID: | CAA4eK1KN8mrNaKMyiDyGDpcsHEw6=7wArr4KsZRm+AO8mWr-tw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Nov 13, 2024 at 5:23 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>
> On 11/13/24 11:59, Amit Kapila wrote:
> > On Tue, Nov 12, 2024 at 12:43 PM Ashutosh Bapat
> > <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> wrote:
> >>
> >> On Tue, Nov 12, 2024 at 12:02 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >>>
> >>> On Mon, Nov 11, 2024 at 2:08 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> >>>>
> >>>>
> >>>> But neither of those fixes prevents backwards move for confirmed_flush
> >>>> LSN, as enforced by asserts in the 0005 patch. I don't know if this
> >>>> assert is incorrect or now. It seems natural that once we get a
> >>>> confirmation for some LSN, we can't move before that position, but I'm
> >>>> not sure about that. Maybe it's too strict.
> >>>
> >>> Hmm, I'm concerned that it might be another problem. I think there are
> >>> some cases where a subscriber sends a flush position older than slot's
> >>> confirmed_flush as a feedback message.
> >>>
> >
> > Right, it can happen for cases where subscribers doesn't have to do
> > anything (for example DDLs) like I have explained in one of my emails
> > [1]
> >
>
> Thanks. I admit not being entirely familiar with all the details, but
> doesn't that email explain more "Why it currently happens?" rather than
> "Is this what should be happening?"
>
Right.
> Sure, the subscriber needs to confirm changes for which nothing needs to
> be done, like DDL. But isn't there a better way to do that, rather than
> allowing confirmed_lsn to go backwards?
>
Firstly, I agree that we should try to find ways to avoid going
confirmed_lsn going backward. We can try to explore solutions both in
the publisher and subscriber-side.
> >>> But it seems to be dangerous if
> >>> we always accept it as a new confirmed_flush value. It could happen
> >>> that confirm_flush could be set to a LSN older than restart_lsn.
> >>>
> >
> > Possible, though I haven't tried to reproduce such a case. But, will
> > it create any issues? I don't know if there is any benefit in allowing
> > to move confirmed_flush LSN backward. AFAIR, we don't allow such
> > backward values to persist. They will temporarily be in memory. I
> > think as a separate patch we should prevent it from moving backward.
> >
> >>
> >> If confirmed_flush LSN moves backwards, it means the transactions
> >> which were thought to be replicated earlier are no longer considered
> >> to be replicated. This means that the restart_lsn of the slot needs to
> >> be at least far back as the oldest of starting points of those
> >> transactions. Thus restart_lsn of slot has to be pushed further back.
> >>
> >
> > I don't see a reason to move restart_lsn backward. Why do you think so?
> >
>
> I think what Ashutosh is saying that if confirmed_flush is allowed to
> move backwards, that may result in start_lsn moving backwards too. And
> we need to be able to decode all transactions committed since start_lsn,
> so if start_lsn moves backwards, maybe restart_lsn needs to move
> backwards too. I have no idea if confirmed_flush/start_lsn can move
> backwards enough to require restart_lsn to move, though.
>
> Anyway, these discussions are a good illustration why I think allowing
> these LSNs to move backwards is a problem. It either causes bugs (like
> with breaking replication slots) and/or it makes the reasoning about
> correct behavior much harder.
>
Right, I think one needs to come up with some reproducible scenarios
where these cause any kind of problem or inefficiency. Then, we can
discuss the solutions accordingly. I mean to say that someone has to
put effort into making a bit more solid case for changing this code
because it may not be a good idea to change something just based on
some theory unless it is just adding some assertions.
--
With Regards,
Amit Kapila.
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Smith | 2024-11-15 03:36:08 | Re: Improve the error message for logical replication of regular column to generated column. |
Previous Message | Robert Haas | 2024-11-15 03:15:39 | Re: Support for NO INHERIT to INHERIT state change with named NOT NULL constraints |