From: | Tomas Vondra <tomas(at)vondra(dot)me> |
---|---|
To: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | logical replication: restart_lsn can go backwards (and more), seems broken since 9.4 |
Date: | 2024-11-06 15:23:49 |
Message-ID: | 85fff40e-148b-4e86-b921-b4b846289132@vondra.me |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
I've been investigating some issues reported by users, related to
logical replication unexpectedly breaking with messages like:
LOG: invalidating slot "s" because its restart_lsn X/Y exceeds
max_slot_wal_keep_size
which is pretty confusing, because the system has that GUC set to -1 (so
disabled, there should be no limit). Or a message like this:
ERROR: requested WAL segment 000...0AA has already been removed
which is a bit less confusing, but still puzzling because replication
slots are meant to prevent exactly this.
I speculated there's some sort of rare race condition, in how we advance
the slot LSN values. I didn't know where to start, so I gave up on
trying to understand the whole complex code. Instead, I wrote a simple
stress test that
1) sets up a cluster (primary + 1+ logical replicas)
2) runs pgbench on the primary, checks replicas/slots
3) randomly restarts the nodes (both primary/replicas) with either fast
and immediate mode, with random short pauses
4) runs checkpoints in tight while loop
This is based on the observations that in past reports the issues seem
to happen only with logical replication, shortly after reconnect (e.g.
after connection reset etc.). And the slots are invalidated / WAL
removed during a checkpoint, so frequent checkpoints should make it
easier to hit ...
Attached is a couple scripts running this - it's not particularly pretty
and may need some tweak to make it work on your machine (run.sh is the
script to run).
And unfortunately, this started to fail pretty quick. The short story is
that it's not difficult to get into a situation where restart_lsn for a
slot moves backwards, so something like this can happen:
1) slot restart_lsn moves forward to LSN A
2) checkpoint happens, updates min required LSN for slots, recycles
segments it considers unnecessary (up to LSN A)
3) slot restart_lsn moves backwards to LSN B (where B < A)
4) kaboom
This would perfectly explain all the symptoms I mentioned earlier. The
max_slot_wal_keep_size reference is confusing, but AFAIK it's just based
on (reasonable) belief that LSN can't move backwards, and so the only
reason for restart_lsn being before min required LSN is that this GUC
kicked in. But the assumption does not hold :-(
Now, why/how can this happen?
I kept adding a elog(LOG) messages to all places updating LSNs for a
slot, and asserts to fail if data.restart_lsn moves backwards. See the
attached 0001 patch. An example for a failure (for the walsended backend
that triggered the assert) looks like this:
344.139 LOG: starting logical decoding for slot "s1"
344.139 DETAIL: Streaming transactions committing after 1/E97EAC30,
reading WAL from 1/E96FB4D0.
344.140 LOG: logical decoding found consistent point at 1/E96FB4D0
344.140 DETAIL: Logical decoding will begin using saved snapshot.
344.140 LOG: LogicalConfirmReceivedLocation 1/E9865398
344.140 LOG: LogicalConfirmReceivedLocation updating
data.restart_lsn to 1/E979D4C8 (from 1/E96FB4D0)
candidate_restart_valid 0/0 (from 1/E9865398)
candidate_restart_lsn 0/0 (from 1/E979D4C8)
344.145 LOG: LogicalIncreaseRestartDecodingForSlot
restart_decoding_lsn 1/E96FB4D0
344.145 LOG: LogicalIncreaseRestartDecodingForSlot s1
candidate_restart_valid_lsn 1/E979D4C8 (0/0)
candidate_restart_lsn 1/E96FB4D0 (0/0)
344.147 LOG: LogicalIncreaseRestartDecodingForSlot
restart_decoding_lsn 1/E979D4C8
344.156 LOG: LogicalIncreaseXminForSlot
candidate_catalog_xmin 30699
candidate_xmin_lsn 1/E993AD68 (0/0)
...
344.235 LOG: LogicalIncreaseRestartDecodingForSlot
restart_decoding_lsn 1/E9F33AF8
344.240 LOG: LogicalConfirmReceivedLocation 1/E9DCCD78
344.240 LOG: LogicalConfirmReceivedLocation updating
data.restart_lsn to 1/E96FB4D0 (from 1/E979D4C8)
candidate_restart_valid 0/0 (from 1/E979D4C8)
candidate_restart_lsn 0/0 (from 1/E96FB4D0)
345.536 LOG: server process (PID 2518127) was terminated by
signal 6: Aborted
We start decoding at 1/E96FB4D0, and right after startup we get a
confirmation, and LogicalConfirmReceivedLocation updates restart_lsn to
1/E979D4C8.
But then LogicalIncreaseRestartDecodingForSlot comes along, and stores
the restart_decoding_lsn 1/E96FB4D0 (which is the initial restart_lsn)
into candidate_restart_lsn.
And then a little bit later we get another confirmation, we call
LogicalConfirmReceivedLocation which propagates candidate_restart_lsn
into data.restart_lsn.
This is how restart_lsn moves backwards, causing issues. I've reproduced
this on PG14 and current master, but I believe the issue exists since
the introduction of logical replication in 9.4 :-(
I'm not claiming this is the only way how this can happen, but all the
cases I've seen in my stress testing look like this. Moreover, I'm not
claiming this is the only LSN field that can move backwards like this.
It seems to me various other candidate_ fields have the same issue, but
may have consequences other than discarding "unnecessary" WAL.
I've been removed of this [1] thread from 2022. I'm 99% sure it's the
same issue - it happened shortly after a reconnect, etc. And it seems to
me Masahiko-san was about the causes in [2]. I don't think the fix
suggested in that message (changing the branch to "else if") can work,
though. At least it did not really help in my testing.
And I'm not sure it'd fix all the issues - it only affects restart_lsn,
but AFAICS the same issue (LSNs moving backwards) can happen for the
other LSN slot field (candidate_xmin_lsn).
I don't see how the backwards move could be valid for any of those
fields, and for "propagating" the candidate values into restart_lsn.
But there's no protection against the backward moves - so either it's
considered to be OK (which seems incorrect), or it was not expected to
happen in practice.
The 0001 patch adds an assert preventing those backward moves on all the
fields. This means it fails with ABORT earlier, even before a checkpoint
gets a chance to invalidate the slot or remove the segment.
0002 part replaces the asserts with elog(LOG), and instead tweaks the
updates to do Max(old,new) to prevent the backward moves. With this I'm
no longer able to reproduce the issue - and there's a lot of LOG
messages about the (prevented) backward moves.
Unfortunately, I'm not sure this is quite correct. Because consider for
example this:
slot->candidate_catalog_xmin = xmin;
slot->candidate_xmin_lsn = Max(slot->candidate_xmin_lsn,
current_lsn);
I suspect this means the fields could get "out of sync". Not sure what
could break because of this.
I have to admit the "protocol" for the candidate fields (and how the
values propagate) is not very clear to me. And AFAICS it's not really
described/explained anywhere :-(
Note: While working on this, I realized PG14 and PG15 needs the fix
eb27d3dc8887, which was backpatched only to 16+. But I hit that on 14
too during testing. I already pinged Daniel about this, but cherry-pick
that before testing before he has time for that.
regards
[1] https://www.postgresql.org/message-id/Yz2hivgyjS1RfMKs%40depesz.com
--
Tomas Vondra
Attachment | Content-Type | Size |
---|---|---|
logrep-script.tgz | application/x-compressed-tar | 2.3 KB |
vmaster-0001-WIP-add-elog-messages-and-asserts.patch | text/x-patch | 13.9 KB |
vmaster-0002-WIP-change-asserts-to-elog-defense.patch | text/x-patch | 5.8 KB |
v14-0001-WIP-add-elog-messages-and-asserts.patch | text/x-patch | 13.9 KB |
v14-0002-WIP-change-asserts-to-elog-defense.patch | text/x-patch | 5.8 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Alvaro Herrera | 2024-11-06 15:38:22 | Re: [BUG] Fix DETACH with FK pointing to a partitioned table fails |
Previous Message | Karina Litskevich | 2024-11-06 15:09:25 | Re: Add missing tab completion for ALTER TABLE ADD COLUMN IF NOT EXISTS |