Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Vitaly Davydov <v(dot)davydov(at)postgrespro(dot)ru>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Date: 2024-11-20 17:24:40
Message-ID: 683f3c82-d38e-436d-88fd-27722af62005@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 10/31/24 11:18, Vitaly Davydov wrote:
> Dear Hackers,
>
>  
>
> I'd like to discuss a problem with replication slots's restart LSN.
> Physical slots are saved to disk at the beginning of checkpoint. At the
> end of checkpoint, old WAL segments are recycled or removed from disk,
> if they are not kept by slot's restart_lsn values.
>

I agree that if we can lose WAL still needed for a replication slot,
that is a bug. Retaining the WAL is the primary purpose of slots, and we
just fixed a similar issue for logical replication.

>
> If an existing physical slot is advanced in the middle of checkpoint
> execution, WAL segments, which are related to saved on disk restart LSN
> may be removed. It is because the calculation of the replication slot
> miminal LSN is occured at the end of checkpoint, prior to old WAL
> segments removal. If to hard stop (pg_stl -m immediate) the postgres
> instance right after checkpoint and to restart it, the slot's
> restart_lsn may point to the removed WAL segment. I believe, such
> behaviour is not good.
>

Not sure I 100% follow, but let me rephrase, just so that we're on the
same page. CreateCheckPoint() does this:

... something ...

CheckPointGuts(checkPoint.redo, flags);

... something ...

RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr,
checkPoint.ThisTimeLineID);

The slots get synced in CheckPointGuts(), so IIUC you're saying if the
slot gets advanced shortly after the sync, the RemoveOldXlogFiles() may
remove still-needed WAL, because we happen to consider a fresh
restart_lsn when calculating the logSegNo. Is that right?

>
> The doc [0] describes that restart_lsn may be set to the some past value
> after reload. There is a discussion [1] on pghackers where such
> behaviour is discussed. The main reason of not flushing physical slots
> on advancing is a performance reason. I'm ok with such behaviour, except
> of that the corresponding WAL segments should not be removed.
>

I don't know which part of [0] you refer to, but I guess you're
referring to this:

The current position of each slot is persisted only at checkpoint,
so in the case of a crash the slot may return to an earlier LSN,
which will then cause recent changes to be sent again when the
server restarts. Logical decoding clients are responsible for
avoiding ill effects from handling the same message more than once.

Yes, it's fine if we discard the new in-memory restart_lsn value, and we
do this for performance reasons - flushing the slot on every advance
would be very expensive. I haven't read [1] as it's quite long, but I
guess that's what it says.

But we must not make any "permanent" actions based on the unflushed
value, I think. Like, we should not remove WAL segments, for example.

>  
>
> I propose to keep WAL segments by saved on disk (flushed) restart_lsn of
> slots. Add a new field restart_lsn_flushed into ReplicationSlot
> structure. Copy restart_lsn to restart_lsn_flushed in SaveSlotToPath. It
> doesn't change the format of storing the slot contents on disk. I
> attached a patch. It is not yet complete, but demonstate a way to solve
> the problem.
>

That seems like a possible fix this, yes. And maybe it's the right one.

>
> I reproduced the problem by the following way:
>
> * Add some delay in CheckPointBuffers (pg_usleep) to emulate long
> checkpoint execution.
> * Execute checkpoint and pg_replication_slot_advance right after
> starting of the checkpoint from another connection.
> * Hard restart the server right after checkpoint completion.
> * After restart slot's restart_lsn may point to removed WAL segment.
>
> The proposed patch fixes it.
>

I tried to reproduce the issue using a stress test (checkpoint+restart
in a loop), but so far without success :-(

Can you clarify where exactly you added the pg_usleep(), and how long
are the waits you added? I wonder if the sleep is really needed,
considering the checkpoints are spread anyway. Also, what you mean by
"hard reset"?

What confuses me a bit is that we update the restart_lsn (and call
ReplicationSlotsComputeRequiredLSN() to recalculate the global value)
all the time. Walsender does that in PhysicalConfirmReceivedLocation for
example. So we actually see the required LSN to move during checkpoint
very often. So how come we don't see the issues much more often? Surely
I miss something important.

Another option might be that pg_replication_slot_advance() doesn't do
something it should be doing. For example, shouldn't be marking the slot
as dirty?

regards

--
Tomas Vondra

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Devulapalli, Raghuveer 2024-11-20 17:37:33 RE: Use __attribute__((target(sse4.2))) for SSE42 CRC32C
Previous Message Daniel Gustafsson 2024-11-20 17:14:44 Re: Replace current implementations in crypt() and gen_salt() to OpenSSL