Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4
Date: 2024-11-15 18:21:05
Message-ID: 77ff7cb2-f820-42a3-82b5-8c585ee8afbf@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11/15/24 04:26, Amit Kapila wrote:
> On Wed, Nov 13, 2024 at 5:23 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>>
>> On 11/13/24 11:59, Amit Kapila wrote:
>>> On Tue, Nov 12, 2024 at 12:43 PM Ashutosh Bapat
>>> <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> wrote:
>>>>
>>>> On Tue, Nov 12, 2024 at 12:02 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>>>>>
>>>>> On Mon, Nov 11, 2024 at 2:08 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>>>>>>
>>>>>>
>>>>>> But neither of those fixes prevents backwards move for confirmed_flush
>>>>>> LSN, as enforced by asserts in the 0005 patch. I don't know if this
>>>>>> assert is incorrect or now. It seems natural that once we get a
>>>>>> confirmation for some LSN, we can't move before that position, but I'm
>>>>>> not sure about that. Maybe it's too strict.
>>>>>
>>>>> Hmm, I'm concerned that it might be another problem. I think there are
>>>>> some cases where a subscriber sends a flush position older than slot's
>>>>> confirmed_flush as a feedback message.
>>>>>
>>>
>>> Right, it can happen for cases where subscribers doesn't have to do
>>> anything (for example DDLs) like I have explained in one of my emails
>>> [1]
>>>
>>
>> Thanks. I admit not being entirely familiar with all the details, but
>> doesn't that email explain more "Why it currently happens?" rather than
>> "Is this what should be happening?"
>>
>
> Right.
>
>> Sure, the subscriber needs to confirm changes for which nothing needs to
>> be done, like DDL. But isn't there a better way to do that, rather than
>> allowing confirmed_lsn to go backwards?
>>
>
> Firstly, I agree that we should try to find ways to avoid going
> confirmed_lsn going backward. We can try to explore solutions both in
> the publisher and subscriber-side.
>

Good that we agree.

I'm not sure how could we do this on the subscriber side. Or more
precisely, I see "LSNs don't go backwards" as an invariant that the
publisher can enforce for all subscribers (with whatever plugin).

And if a subscriber responds in a way that contradicts the invariant,
I'd treat that as a subscriber bug. I don't think we should rely on
subscriber to do the right thing - we have little control over that, and
when things break (say, WAL gets removed too early), people generally
point at the publisher.

>>
>> I think what Ashutosh is saying that if confirmed_flush is allowed to
>> move backwards, that may result in start_lsn moving backwards too. And
>> we need to be able to decode all transactions committed since start_lsn,
>> so if start_lsn moves backwards, maybe restart_lsn needs to move
>> backwards too. I have no idea if confirmed_flush/start_lsn can move
>> backwards enough to require restart_lsn to move, though.
>>
>> Anyway, these discussions are a good illustration why I think allowing
>> these LSNs to move backwards is a problem. It either causes bugs (like
>> with breaking replication slots) and/or it makes the reasoning about
>> correct behavior much harder.
>>
>
> Right, I think one needs to come up with some reproducible scenarios
> where these cause any kind of problem or inefficiency. Then, we can
> discuss the solutions accordingly. I mean to say that someone has to
> put effort into making a bit more solid case for changing this code
> because it may not be a good idea to change something just based on
> some theory unless it is just adding some assertions.
>

I don't know, but isn't this a bit backwards? I understand we don't want
to just recklessly change long-standing code, but I think it's sensible
to enforce sensible invariants like "LSNs can't go backwards" to ensure
correct behavior. And then if there's a performance/efficiency problem,
and someone proposes a fix, it's up to that patch to prove it's correct.

Here we seem to have code no one is quite sure is correct, but we're
asking for reproducible scenarios proving the existence of a bug. The
bugs can be quite subtle / hard to reproduce, as evidenced by the
restart_lsn issue present since 9.4. Maybe this should be the other way
around, i.e. make it "strict" and then prove that (a) relaxing the check
is still correct and (b) it actually has other benefits.

That being said, I don't have a clear idea how to change this.

So +1 to at least introducing the asserts.

regards

--
Tomas Vondra

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message David E. Wheeler 2024-11-15 18:24:35 Re: Potential ABI breakage in upcoming minor releases
Previous Message Pavan Deolasee 2024-11-15 18:09:24 Re: Potential ABI breakage in upcoming minor releases