Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4
Date: 2024-11-09 11:45:06
Message-ID: ddb636b7-2581-4a0a-8fd9-4e90e8b79195@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11/8/24 19:25, Masahiko Sawada wrote:
> Hi,
>
> Thank you for investigating this issue.
>
> On Thu, Nov 7, 2024 at 10:40 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>>
>> Hi,
>>
>> I kept investigating this, but I haven't made much progress. I still
>> don't understand why would it be OK to move any of the LSN fields
>> backwards - certainly for fields like confirm_flush or restart_lsn.
>>
>> I did a simple experiment - added asserts to the couple places in
>> logical.c updating the the LSN fields, checking the value is increased.
>> But then I simply ran make check-world, instead of the stress test.
>>
>> And that actually fails too, 040_standby_failover_slots_sync.pl triggers
>> this
>>
>> {
>> SpinLockAcquire(&MyReplicationSlot->mutex);
>> Assert(MyReplicationSlot->data.confirmed_flush <= lsn);
>> MyReplicationSlot->data.confirmed_flush = lsn;
>> SpinLockRelease(&MyReplicationSlot->mutex);
>> }
>>
>> So this moves confirm_flush back, albeit only by a tiny amount (I've
>> seen ~56 byte difference). I don't have an example of this causing an
>> issue in practice, but I note that CheckPointReplicationSlots does this:
>>
>> if (is_shutdown && SlotIsLogical(s))
>> {
>> SpinLockAcquire(&s->mutex);
>>
>> if (s->data.invalidated == RS_INVAL_NONE &&
>> s->data.confirmed_flush > s->last_saved_confirmed_flush)
>> {
>> s->just_dirtied = true;
>> s->dirty = true;
>> }
>> SpinLockRelease(&s->mutex);
>> }
>>
>> to determine if a slot needs to be flushed to disk during checkpoint. So
>> I guess it's possible we save a slot to disk at some LSN, then the
>> confirm_flush moves backward, and we fail to sync the slot to disk.
>>
>> But I don't have a reproducer for this ...
>>
>>
>> I also noticed a strange difference between LogicalIncreaseXminForSlot
>> and LogicalIncreaseRestartDecodingForSlot.
>>
>> The structure of LogicalIncreaseXminForSlot looks like this:
>>
>> if (TransactionIdPrecedesOrEquals(xmin, slot->data.catalog_xmin))
>> {
>> }
>> else if (current_lsn <= slot->data.confirmed_flush)
>> {
>> ... update candidate fields ...
>> }
>> else if (slot->candidate_xmin_lsn == InvalidXLogRecPtr)
>> {
>> ... update candidate fields ...
>> }
>>
>> while LogicalIncreaseRestartDecodingForSlot looks like this:
>>
>> if (restart_lsn <= slot->data.restart_lsn)
>> {
>> }
>> else if (current_lsn <= slot->data.confirmed_flush)
>> {
>> ... update candidate fields ...
>> }
>>
>> if (slot->candidate_restart_valid == InvalidXLogRecPtr)
>> {
>> ... update candidate fields ...
>> }
>>
>> Notice that LogicalIncreaseXminForSlot has the third block guarded by
>> "else if", while LogicalIncreaseRestartDecodingForSlot has "if". Isn't
>> that a bit suspicious, considering the functions do the same thing, just
>> for different fields? I don't know if this is dangerous, the comments
>> suggest it may just waste extra effort after reconnect.
>>
>
> I also suspected this point. I still need to investigate if this
> suspicion is related to the issue but I find this code in
> LogicalIncreaseRestartDecodingForSlot() is dangerous.
>
> We update slot's restart_lsn based on candidate_lsn and
> candidate_valid upon receiving a feedback message from a subscriber,
> then clear both fields. Therefore, this code in
> LogicalIncreaseRestartDecodingForSlot() means that it sets an
> arbitrary LSN to candidate_restart_lsn after updating slot's
> restart_lsn.
>
> I think an LSN older than slot's restart_lsn can be passed to
> LogicalIncreaseRestartDecodingForSlot() as restart_lsn for example
> after logical decoding restarts; My scenario I shared on another
> thread was that after updating slot's restart_lsn (upon feedback from
> a subscriber) based on both candidate_restart_lsn and
> candidate_restart_valid that are remained in the slot, we might call
> LogicalIncreaseRestartDecodingForSlot() when decoding a RUNNING_XACTS
> record whose LSN is older than the slot's new restart_lsn. In this
> case, we end up passing an LSN older than the new restart_lsn to
> LogicalIncreaseRestartDecodingForSlot(), and that LSN is set to
> candidate_restart_lsn.

Right, I believe that matches my observations. I only see the issues
after (unexpected) restarts, say due to network issues, but chances are
regular reconnects have the same problem.

> My hypothesis is that we wanted to prevent such
> case by the first if block:
>
> /* don't overwrite if have a newer restart lsn */
> if (restart_lsn <= slot->data.restart_lsn)
> {
> }
>

Yeah, that condition / comment seems to say exactly that.

Do you plan / expect to work on fixing this? It seems you proposed the
right fix in that old thread, but it's been inactive since 2023/02 :-(

regards

--
Tomas Vondra

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2024-11-09 12:29:12 Re: not null constraints, again
Previous Message Tomas Vondra 2024-11-09 11:35:32 Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4