Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

From: Andres Freund <andres(at)anarazel(dot)de>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: depesz(at)depesz(dot)com, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, pgsql-bugs mailing list <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Date: 2022-11-21 20:08:36
Message-ID: 20221121200836.wov46biwtramawmq@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi,

On 2022-11-21 19:56:20 +0530, Amit Kapila wrote:
> I think this problem could arise when walsender exits due to some
> error like "terminating walsender process due to replication timeout".
> Here is the theory I came up with:
>
> 1. Initially the restart_lsn is updated to 1039D/83825958. This will
> allow all files till 000000000001039D00000082 to be removed.
> 2. Next the slot->candidate_restart_lsn is updated to a 1039D/8B5773D8.
> 3. walsender restarts due to replication timeout.
> 4. After restart, it starts reading WAL from 1039D/83825958 as that
> was restart_lsn.
> 5. walsender gets a message to update write, flush, apply, etc. As
> part of that, it invokes
> ProcessStandbyReplyMessage->LogicalConfirmReceivedLocation.
> 6. Due to step 5, the restart_lsn is updated to 1039D/8B5773D8 and
> replicationSlotMinLSN will also be computed to the same value allowing
> to remove of all files older than 000000000001039D0000008A. This will
> allow removing 000000000001039D00000083, 000000010001039D00000084,
> etc.

This would require that the client acknowledged an LSN that we haven't
sent out, no? Shouldn't the
MyReplicationSlot->candidate_restart_valid <= lsn
from LogicalConfirmReceivedLocation() prevented this from happening
unless the client acknowledges up to candidate_restart_valid?

> 7. Now, we got new slot->candidate_restart_lsn as 1039D/83825958.
> Remember from step 1, we are still reading WAL from that location.

I don't think LogicalIncreaseRestartDecodingForSlot() would do anything
in that case, because of the
/* don't overwrite if have a newer restart lsn */
check.

> If this diagnosis is correct, I think we need to clear
> candidate_restart_lsn and friends during ReplicationSlotRelease().

Possible, but I don't quite see it yet.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Amit Kapila 2022-11-22 03:36:22 Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Previous Message PG Bug reporting form 2022-11-21 16:00:58 BUG #17692: Unable to connect to database after docker-compose