Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

From: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4
Date: 2024-11-14 06:15:56
Message-ID: CAExHW5uK1QPhJ_C090xyJVK03xwbhYPLiZT1xSarUF908z2HWA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Nov 13, 2024 at 5:25 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>
> On 11/13/24 05:38, Ashutosh Bapat wrote:
> > ...
> >
> > Here's way we can fix SnapBuildProcessRunningXacts() similar to
> > DecodeCommit(). DecodeCommit() uses SnapBuildXactNeedsSkip() to decide
> > whether a given transaction should be decoded or not.
> > /*
> > * Should the contents of transaction ending at 'ptr' be decoded?
> > */
> > bool
> > SnapBuildXactNeedsSkip(SnapBuild *builder, XLogRecPtr ptr)
> > {
> > return ptr < builder->start_decoding_at;
> > }
> >
> > Similar to SnapBuild::start_decoding_at we could maintain a field
> > SnapBuild::start_reading_at to the LSN from which the WAL sender would
> > start reading WAL. If candidate_restart_lsn produced by a running
> > transactions WAL record is less than SnapBuild::start_reading_at,
> > SnapBuildProcessRunningXacts() won't call
> > LogicalIncreaseRestartDecodingForSlot() with that candiate LSN. We
> > won't access the slot here and the solution will be inline with
> > DecodeCommit() which skips the transactions.
> >
>
> Could you maybe write a patch doing this? That would allow proper
> testing etc.

Here's a quick and dirty patch which describes the idea. I didn't get
time to implement code to move SnapBuild::restart_lsn if
SnapBuild::start_decoding_at moves forward while building initial
snapshot. I am not sure whether that's necessary either.

I have added three elogs to see if the logic is working as expected. I
see two of the elogs in patch in the server log when I run tests from
tests/subscription and tests/recovery. But I do not see the third one.
That either means that the situation causing the bug is not covered by
those tests or the fix is not triggered. If you run your reproduction
and still see the crashes please provide the output of those elog
messages along with the rest of the elogs you have added.

--
Best Wishes,
Ashutosh Bapat

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Shubham Khanna 2024-11-14 06:18:39 Improve the error message for logical replication of regular column to generated column.
Previous Message Suraj Kharage 2024-11-14 05:01:56 Support for NO INHERIT to INHERIT state change with named NOT NULL constraints