Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4
Date: 2024-11-13 11:53:01
Message-ID: 219c9a65-ccac-4ae1-9b83-8b1a06a75184@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11/13/24 11:59, Amit Kapila wrote:
> On Tue, Nov 12, 2024 at 12:43 PM Ashutosh Bapat
> <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> wrote:
>>
>> On Tue, Nov 12, 2024 at 12:02 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>>>
>>> On Mon, Nov 11, 2024 at 2:08 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>>>>
>>>>
>>>> But neither of those fixes prevents backwards move for confirmed_flush
>>>> LSN, as enforced by asserts in the 0005 patch. I don't know if this
>>>> assert is incorrect or now. It seems natural that once we get a
>>>> confirmation for some LSN, we can't move before that position, but I'm
>>>> not sure about that. Maybe it's too strict.
>>>
>>> Hmm, I'm concerned that it might be another problem. I think there are
>>> some cases where a subscriber sends a flush position older than slot's
>>> confirmed_flush as a feedback message.
>>>
>
> Right, it can happen for cases where subscribers doesn't have to do
> anything (for example DDLs) like I have explained in one of my emails
> [1]
>

Thanks. I admit not being entirely familiar with all the details, but
doesn't that email explain more "Why it currently happens?" rather than
"Is this what should be happening?"

Sure, the subscriber needs to confirm changes for which nothing needs to
be done, like DDL. But isn't there a better way to do that, rather than
allowing confirmed_lsn to go backwards?

>>> But it seems to be dangerous if
>>> we always accept it as a new confirmed_flush value. It could happen
>>> that confirm_flush could be set to a LSN older than restart_lsn.
>>>
>
> Possible, though I haven't tried to reproduce such a case. But, will
> it create any issues? I don't know if there is any benefit in allowing
> to move confirmed_flush LSN backward. AFAIR, we don't allow such
> backward values to persist. They will temporarily be in memory. I
> think as a separate patch we should prevent it from moving backward.
>
>>
>> If confirmed_flush LSN moves backwards, it means the transactions
>> which were thought to be replicated earlier are no longer considered
>> to be replicated. This means that the restart_lsn of the slot needs to
>> be at least far back as the oldest of starting points of those
>> transactions. Thus restart_lsn of slot has to be pushed further back.
>>
>
> I don't see a reason to move restart_lsn backward. Why do you think so?
>

I think what Ashutosh is saying that if confirmed_flush is allowed to
move backwards, that may result in start_lsn moving backwards too. And
we need to be able to decode all transactions committed since start_lsn,
so if start_lsn moves backwards, maybe restart_lsn needs to move
backwards too. I have no idea if confirmed_flush/start_lsn can move
backwards enough to require restart_lsn to move, though.

Anyway, these discussions are a good illustration why I think allowing
these LSNs to move backwards is a problem. It either causes bugs (like
with breaking replication slots) and/or it makes the reasoning about
correct behavior much harder.

regards

--
Tomas Vondra

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2024-11-13 11:53:34 Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4
Previous Message Hayato Kuroda (Fujitsu) 2024-11-13 11:10:43 RE: Parallel heap vacuum