Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Vitaly Davydov <v(dot)davydov(at)postgrespro(dot)ru>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Date: 2025-04-04 03:22:19
Message-ID: CAPpHfdvQPndPMbPvrHqUYbi_HnTZxxaZ4cafvF+9Mx56v07W-w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi, Vitaly!

On Mon, Mar 3, 2025 at 5:12 PM Vitaly Davydov <v(dot)davydov(at)postgrespro(dot)ru> wrote:
> The slot data is flushed to the disk at the beginning of checkpoint. If
> an existing slot is advanced in the middle of checkpoint execution, its
> advanced restart LSN is taken to calculate the oldest LSN for WAL
> segments removal at the end of checkpoint. If the node is restarted just
> after the checkpoint, the slots data will be read from the disk at
> recovery with the oldest restart LSN which can refer to removed WAL
> segments.
>
> The patch introduces a new in-memory state for slots -
> flushed_restart_lsn which is used to calculate the oldest LSN for WAL
> segments removal. This state is updated every time with the current
> restart_lsn at the moment, when the slot is saving to disk.

Thank you for your work on this subject. I think generally your
approach is correct. When we're truncating the WAL log, we need to
reply on the position that would be used in the case of server crush.
That is the position flushed to the disk.

While your patch is generality looks good, I'd like make following notes:

1) As ReplicationSlotsComputeRequiredLSN() is called each time we need
to advance the position of WAL needed by replication slots, the usage
pattern probably could be changed. Thus, we probably need to call
ReplicationSlotsComputeRequiredLSN() somewhere after change of
restart_lsn_flushed while restart_lsn is not changed. And probably
can skip ReplicationSlotsComputeRequiredLSN() in some cases when only
restart_lsn is changed.
2) I think it's essential to include into the patch test caches which
fail without patch. You could start from integrating [1] test into
your patch, and then add more similar tests for different situations.

Links.
1. https://www.postgresql.org/message-id/e3ac0535-e7a2-4a96-9b36-9f765e9cfec5%40vondra.me

------
Regards,
Alexander Korotkov
Supabase

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Hayato Kuroda (Fujitsu) 2025-04-04 03:28:27 RE: pg_recvlogical cannot create slots with failover=true
Previous Message jian he 2025-04-04 03:20:07 Re: speedup COPY TO for partitioned table.