Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Date: 2024-11-21 13:59:27
Message-ID: ec90c6ed-92eb-4fa5-861b-af6167f04c43@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11/20/24 23:19, Tomas Vondra wrote:
> On 11/20/24 18:24, Tomas Vondra wrote:
>>
>> ...
>>
>> What confuses me a bit is that we update the restart_lsn (and call
>> ReplicationSlotsComputeRequiredLSN() to recalculate the global value)
>> all the time. Walsender does that in PhysicalConfirmReceivedLocation for
>> example. So we actually see the required LSN to move during checkpoint
>> very often. So how come we don't see the issues much more often? Surely
>> I miss something important.
>>
>
> This question "How come we don't see this more often?" kept bugging me,
> and the answer is actually pretty simple.
>
> The restart_lsn can move backwards after a hard restart (for the reasons
> explained), but physical replication does not actually rely on that. The
> replica keeps track of the LSN it received (well, it uses the same LSN),
> and on reconnect it sends the startpoint to the primary. And the primary
> just proceeds use that instead of the (stale) restart LSN for the slot.
> And the startpoint is guaranteed (I think) to be at least restart_lsn.
>
> AFAICS this would work for pg_replication_slot_advance() too, that is if
> you remember the last LSN the slot advanced to, it should be possible to
> advance to it just fine. Of course, it requires a way to remember that
> LSN, which for a replica is not an issue. But this just highlights we
> can't rely on restart_lsn for this purpose.
>

I kept thinking about this (sorry it's this incremental), particularly
if this applies to logical replication too. And AFAICS it does not, or
at least not to this extent.

For streaming, the subscriber sends the startpoint (just like physical
replication), so it should be protected too.

But then there's the SQL API - pg_logical_slot_get_changes(). And it
turns out it ends up syncing the slot to disk pretty often, because for
RUNNING_XACTS we call LogicalDecodingProcessRecord() + standby_decode(),
which ends up calling SaveSlotToDisk(). And at the end we call
LogicalConfirmReceivedLocation() for good measure, which saves the slot
too, just to be sure.

FWIW I suspect this still is not perfectly safe, because we may still
crash / restart before the updated data.restart_lsn makes it to disk,
but after it was already used to remove old WAL, although that's
probably harder to hit. With streaming the subscriber will still send us
the new startpoint, so that should not fail I think. But with the SQL
API we probably can get into the "segment already removed" issues.

I haven't tried reproducing this yet, I guess it should be possible
using the injection points. Not sure when I get to this, though.

In any case, doesn't this suggest SaveSlotToDisk() really is not that
expensive, if we do it pretty often for logical replication? Which was
presented as the main reason why pg_replication_slot_advance() doesn't
do that. Maybe it should?

If the advance is substantial I don't think it really matters, because
there simply can't be that many of large advances. It amortizes, in a
way. But even with smaller advances it should be fine, I think - if the
goal is to not remove WAL prematurely, it's enough to flush when we move
to the next segment.

regards

--
Tomas Vondra

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2024-11-21 14:03:21 Re: Windows 2016 server crashed after changes in Postgres 15.8 pgAdmin
Previous Message jian he 2024-11-21 13:58:03 Re: Proposal to use JSON for Postgres Parser format