Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

From: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4
Date: 2024-11-13 04:38:01
Message-ID: CAExHW5tTDyb1Hkd4jj0KukWTY0NZ69h5oqN8zMR6fBYrQfM4EA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Nov 12, 2024 at 4:55 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>
>
>
> On 11/12/24 10:37, Ashutosh Bapat wrote:
> > On Tue, Nov 12, 2024 at 4:54 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> >>
> >>
> >>
> >> On 11/11/24 23:41, Masahiko Sawada wrote:
> >>> On Mon, Nov 11, 2024 at 6:17 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> >>>
> >>> Which made me re-investigate the issue and thought that it doesn't
> >>> necessarily need to clear these candidate values in memory on
> >>> releasing a slot as long as we're carefully updating restart_lsn.
> >>
> >> Not sure, but maybe it'd be useful to ask the opposite question. Why
> >> shouldn't it be correct to reset the fields, which essentially puts the
> >> slot into the same state as if it was just read from disk? That also
> >> discards all these values, and we can't rely on accidentally keeping
> >> something important info in memory (because if the instance restarts
> >> we'd lose that).
> >>
> >> But this reminds me that the patch I shared earlier today resets the
> >> slot in the ReplicationSlotAcquire() function, but I guess that's not
> >> quite correct. It probably should be in the "release" path.
> >>
> >>> Which seems a bit efficient for example when restarting from a very
> >>> old point. Of course, even if we reset them on releasing a slot, it
> >>> would perfectly work since it's the same as restarting logical
> >>> decoding with a server restart.
> >>
> >> I find the "efficiency" argument a bit odd. It'd be fine if we had a
> >> correct behavior to start with, but we don't have that ... Also, I'm not
> >> quite sure why exactly would it be more efficient?
> >>
> >> And how likely is this in practice? It seems to me that
> >> performance-sensitive cases can't do reconnects very often anyway,
> >> that's inherently inefficient. No?
> >>
> >>> I think
> >>> LogicalIncreaseRestartDecodingForSlot() should be fixed as it seems
> >>> not to be working expectedly, but I could not have proof that we
> >>> should either keep or reset them on releasing a slot.
> >>>
> >>
> >> Not sure. Chances are we need both fixes, if only to make
> >> LogicalIncreaseRestartDecodingForSlot more like the other function.
> >>
> >
> > Thanks a lot for pointing to Masahiko's analysis. I missed that part
> > when I read that thread. Sorry.
> >
>
> No need to apologize. The discussion is complex and spread over a rather
> long time period.
>
> > A candidate_restart_lsn and candidate_restart_valid pair just tells
> > that we may set slot's data.restart_lsn to candidate_restart_lsn when
> > the downstream confirms an LSN >= candidate_restart_valid. That pair
> > can never get inaccurate. It may get stale but never inaccurate. So
> > wiping those fields from ReplicationSlot is unnecessary.
> >
>
> Isn't this issue a proof that those fields *can* get inaccurate? Or what
> do you mean by "stale but not inaccurate"?
>
> > What should ideally happen is we should ignore candidates produced by
> > older running transactions WAL records after WAL sender restart. This
> > is inline with what logical replication does with transactions
> > committed before slot's confirmed_flush_lsn - those are ignored. But
> > the criteria for ignoring running transactions records is slightly
> > different from that for transactions. If we ignore
> > candidate_restart_lsn which has candidate_restart_valid <=
> > confirmed_flush_lsn, we might lose some opportunity to advance
> > data.restart_lsn. Instead we should ignore any candidate_restart_lsn
> > <= data.restart_lsn especially before WAL sender finds first change to
> > send downstream. We can do that in SnapBuildProcessRunningXacts() by
> > accessing MyReplicationSlot, taking lock on it and then comparing
> > data.restart_lsn with txn->restart_decoding_lsn before calling
> > LogicalIncreaseRestartDecodingForSlot(). But then
> > LogicalIncreaseRestartDecodingForSlot() would be doing the same anyway
> > after applying your patch 0004. The only downside of 0004 is that the
> > logic to ignore candidates produced by a running transactions record
> > is not clearly visible in SnapBuildProcessRunningXacts(). For a
> > transaction which is ignored the logic to ignore the transaction is
> > visible in DecodeCommit() or DecodeAbort() - where people are likely
> > to look for that logic. We may add a comment to that effect in
> > SnapBuildProcessRunningXacts().
> >
>
> I have thought about just doing something like:
>
> slot->data.restart_lsn = Max(slot->data.restart_lsn, new_lsn);
>
> and similar for the other LSN fields. And it did resolve the issue at
> hand, of course. But it seems sloppy, and I'm worried it might easily
> mask actual issues in other cases.
>
> I'm still of the opinion that (with the exception of a reconnect), these
> places should not need to deal with values that go backwards. It should
> work just fine without the Max(), and we should add Asserts() to check
> that it's always a higher LSN.
>
> For the reconnect, I think it's a bit as if the primary restarted right
> before the reconnect. That could happen anyway, and we need to handle
> that correctly - if not, we have yet another issue, IMHO. And with the
> restart it's the same as writing the slot to disk and reading it back,
> which also doesn't retain most of the fields. So it seems cleaner to do
> the same thing and just reset the various fields.
>
>
> I haven't thought about making SnapBuildProcessRunningXacts() more
> complex to consider this stuff. But I doubt we'd like to be accessing
> slots from that code - it has nothing to do with slots.

Here's way we can fix SnapBuildProcessRunningXacts() similar to
DecodeCommit(). DecodeCommit() uses SnapBuildXactNeedsSkip() to decide
whether a given transaction should be decoded or not.
/*
* Should the contents of transaction ending at 'ptr' be decoded?
*/
bool
SnapBuildXactNeedsSkip(SnapBuild *builder, XLogRecPtr ptr)
{
return ptr < builder->start_decoding_at;
}

Similar to SnapBuild::start_decoding_at we could maintain a field
SnapBuild::start_reading_at to the LSN from which the WAL sender would
start reading WAL. If candidate_restart_lsn produced by a running
transactions WAL record is less than SnapBuild::start_reading_at,
SnapBuildProcessRunningXacts() won't call
LogicalIncreaseRestartDecodingForSlot() with that candiate LSN. We
won't access the slot here and the solution will be inline with
DecodeCommit() which skips the transactions.

--
Best Wishes,
Ashutosh Bapat

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Shlok Kyal 2024-11-13 05:45:30 Re: Disallow UPDATE/DELETE on table with unpublished generated column as REPLICA IDENTITY
Previous Message Hayato Kuroda (Fujitsu) 2024-11-13 03:59:39 RE: Commit Timestamp and LSN Inversion issue