Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

From: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(at)vondra(dot)me>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4
Date: 2024-11-12 07:13:17
Message-ID: CAExHW5tdrcQaiiVUCHzs7i6SepuU6as_UaE7fU-_h5xiR=jOQw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Nov 12, 2024 at 12:02 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Mon, Nov 11, 2024 at 2:08 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> >
> >
> > But neither of those fixes prevents backwards move for confirmed_flush
> > LSN, as enforced by asserts in the 0005 patch. I don't know if this
> > assert is incorrect or now. It seems natural that once we get a
> > confirmation for some LSN, we can't move before that position, but I'm
> > not sure about that. Maybe it's too strict.
>
> Hmm, I'm concerned that it might be another problem. I think there are
> some cases where a subscriber sends a flush position older than slot's
> confirmed_flush as a feedback message. But it seems to be dangerous if
> we always accept it as a new confirmed_flush value. It could happen
> that confirm_flush could be set to a LSN older than restart_lsn.
>

If confirmed_flush LSN moves backwards, it means the transactions
which were thought to be replicated earlier are no longer considered
to be replicated. This means that the restart_lsn of the slot needs to
be at least far back as the oldest of starting points of those
transactions. Thus restart_lsn of slot has to be pushed further back.
That WAL may not be available anymore. Similar issue with
catalog_xmin, the older catalog rows may have been removed. Other
problem is we may send some transactions twice, which might cause
trouble downstream. So I agree that confirmed_flush LSN should not
move backwards. IIRC, if the downstream sends an older confirmed_flush
in START_REPLICATION message, WAL sender does not consider it and
instead uses the one in replication slot.

--
Best Wishes,
Ashutosh Bapat

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Zhijie Hou (Fujitsu) 2024-11-12 07:22:27 RE: Disallow UPDATE/DELETE on table with unpublished generated column as REPLICA IDENTITY
Previous Message Michael Paquier 2024-11-12 06:56:13 Re: define pg_structiszero(addr, s, r)