Quick Links

Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Date:	2024-11-21 23:05:40
Message-ID:	e3ac0535-e7a2-4a96-9b36-9f765e9cfec5@vondra.me
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 11/21/24 14:59, Tomas Vondra wrote:
>
> ...
>
> But then there's the SQL API - pg_logical_slot_get_changes(). And it
> turns out it ends up syncing the slot to disk pretty often, because for
> RUNNING_XACTS we call LogicalDecodingProcessRecord() + standby_decode(),
> which ends up calling SaveSlotToDisk(). And at the end we call
> LogicalConfirmReceivedLocation() for good measure, which saves the slot
> too, just to be sure.
>
> FWIW I suspect this still is not perfectly safe, because we may still
> crash / restart before the updated data.restart_lsn makes it to disk,
> but after it was already used to remove old WAL, although that's
> probably harder to hit. With streaming the subscriber will still send us
> the new startpoint, so that should not fail I think. But with the SQL
> API we probably can get into the "segment already removed" issues.
>
> I haven't tried reproducing this yet, I guess it should be possible
> using the injection points. Not sure when I get to this, though.
>
I kept pulling on this loose thread, and the deeper I look the more I'm
concvinced ReplicationSlotsComputeRequiredLSN() is fundamentally unsafe.
I may be missing something, of course, in which case I'd be grateful if
someone could correct me.

I believe the main problem is that ReplicationSlotsComputeRequiredLSN()
operates on data that may not be on-disk yet. It just iterates over
slots in shared memory, looks at the data.restart_lsn, and rolls with
that. So some of the data may be lost after a crash or "immediate"
restart, which for restart_lsn means it can move backwards by some
unknown amount.

Unfortunately, some of the callers use the value as if it was durable,
and do irreversible actions based on it. This whole thread is about
checkpointer using the value to discard WAL supposedly not required by
any slot, only to find out we're missing WAL.

That seems like a rather fundamental problem, and the only reason why we
don't see this causing trouble more often is that (a) abrupt restarts
are not very common, (b) most slots are likely not lagging very much,
and thus not in danger of actually losing WAL, and (c) the streaming
replication tracks startpoint, which masks the issue.

But with the SQL API it's quite simple to cause issues with the right
timing, as I'll show in a bit.

There's an interesting difference in how different places update the
slot. For example LogicalConfirmReceivedLocation() does this:

1) update slot->data.restart_lsn
2) mark slot dirty: ReplicationSlotMarkDirty()
3) save slot to disk: ReplicationSlotSave()
4) recalculate required LSN: ReplicationSlotsComputeRequiredLSN()

while pg_replication_slot_advance() does only this:

1) update slot->data.restart_lsn
2) mark slot dirty: ReplicationSlotMarkDirty()
3) recalculate required LSN: ReplicationSlotsComputeRequiredLSN()

That is, it doesn't save the slot to disk. It just updates the LSN and
them proceeds to recalculate the "required LSN" for all slots. That
makes is very easy to hit the issue, as demonstrated by Vitaly.

However, it doesn't mean LogicalConfirmReceivedLocation() is safe. It
would be safe without concurrency, but it can happen that the logical
decoding does (1) and maybe (2), but before the slot gets persisted,
some other session gets to call ReplicationSlotsComputeRequiredLSN().

It might be logical decoding on another slot, or advance of a physical
slot. I haven't checked what else can trigger that.

So ultimately logical slots have exactly the same issue.

Attached are two patches, demonstrating the issue. 0001 adds injection
points into two places - before (2) in LogicalConfirmReceivedLocation,
and before removal of old WAL in a checkpoint. 0002 then adds a simple
TAP test triggering the issue in pg_logical_slot_get_changes(), leading to:

ERROR: requested WAL segment pg_wal/000000010000000000000001 has
already been removed

The same issue could be demonstrated on a physical slot - it would
actually be simpler, I think.

I've been unable to cause issues for streaming replication (both
physical and logical), because the subscriber sends startpoint which
adjusts the restart_lsn to a "good" value. But I'm not sure if that's
reliable in all cases, or if the replication could break too.

It's entirely possible this behavior is common knowledge, but it was a
surprise for me. Even if the streaming replication is safe, it does seem
to make using the SQL functions less reliable (not that it doesn't have
other challenges, e.g. with Ctrl-C). But maybe it could be made safer?

I don't have a great idea how to improve this. It seems wrong for
ReplicationSlotsComputeRequiredLSN() to calculate the LSN using values
from dirty slots, so maybe it should simply retry if any slot is dirty?
Or retry on that one slot? But various places update the restart_lsn
before marking the slot as dirty, so right now this won't work.

regards

--
Tomas Vondra

Attachment	Content-Type	Size
0002-TAP-test.patch	text/x-patch	6.5 KB
0001-injection-points.patch	text/x-patch	2.6 KB

In response to

Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly at 2024-11-21 13:59:27 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Devulapalli, Raghuveer	2024-11-21 23:07:59	RE: Use __attribute__((target(sse4.2))) for SSE42 CRC32C
Previous Message	Andrew Dunstan	2024-11-21 22:46:50	Re: SQL:2023 JSON simplified accessor support