Quick Links

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4
Date:	2024-11-12 12:59:39
Message-ID:	cc0965ff-4b17-4e46-b5ca-fe009734d253@vondra.me
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 11/12/24 13:19, Amit Kapila wrote:
> On Tue, Nov 12, 2024 at 4:55 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>>
>> There's also the question of backpatching - the simpler the better, and
>> this I think just resetting the fields wins in this regard. The main
>> question is whether it's correct - I think it is. I'm not too worried
>> about efficiency very much, on the grounds that this should not matter
>> very often (only after unexpected restart). Correctness > efficiency.
>>
>
> Sure, but what is wrong with changing
> LogicalIncreaseRestartDecodingForSlot's "if
> (slot->candidate_restart_valid == InvalidXLogRecPtr)" to "else if
> (slot->candidate_restart_valid == InvalidXLogRecPtr)"? My previous
> analysis [1][2] on similar issue also leads to that conclusion. Then
> later Sawada-San's email [3] also leads to the same solution. I know
> that the same has been discussed in this thread and we are primarily
> worried about whether we are missing some case that needs the current
> code in LogicalIncreaseRestartDecodingForSlot(). It is always possible
> that all who have analyzed are missing some point but I feel the
> chances are less. I vote to fix LogicalIncreaseRestartDecodingForSlot.
>

I'm not opposed to adjusting LogicalIncreaseRestartDecodingForSlot().
The difference from LogicalIncreaseXminForSlot() seems accidental, in
which case fixing it seems desirable.

> Now, we have at least some theory to not clear candidate_restart_*
> values which is why to prevent advancing restart_lsn earlier if we get
> confirmation from the subscriber. Now, your theory that walsender
> exits should be rare so this doesn't impact much is also true but OTOH
> why change something that can work more efficiently provided we fix
> LogicalIncreaseRestartDecodingForSlot as per our analysis?
>

Because with how it works now it's impossible to check if the LSN values
are correct. We get a LSN value that's before what we already have in
the slot, and it might be either fine after a restart, or a bug (perhaps
on the subscriber side) causing all kinds of strange issues.

I'm all in favor of making stuff more efficient, but only if it doesn't
harm reliability. And that does seem to be happening here, because this
(not resetting + inability to have strict checks on the LSN updates)
seems to be one of the reasons why we have this bug since 9.4.

Sure, maybe fixing LogicalIncreaseRestartDecodingForSlot() is enough to
fix this particular case. But I'd be happier if we could also add
asserts checking the LSN advances, to detect similar issues that we may
be unaware of yet.

regards

--
Tomas Vondra

In response to

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4 at 2024-11-12 12:19:47 from Amit Kapila

Responses

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4 at 2024-11-13 09:38:27 from Amit Kapila

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Peter Eisentraut	2024-11-12 13:25:30	Re: RFC: Additional Directory for Extensions
Previous Message	Kirill Reshke	2024-11-12 12:54:46	CREATE SCHEMA ... CREATE DOMAIN support