pg_upgrade instructions involving "rsync --size-only" might lead to standby corruption?

From: Nikolay Samokhvalov <nik(at)postgres(dot)ai>
To: pgsql-hackers(at)postgresql(dot)org
Subject: pg_upgrade instructions involving "rsync --size-only" might lead to standby corruption?
Date: 2023-06-29 17:50:12
Message-ID: CAM527d8heqkjG5VrvjU3Xjsqxg41ufUyabD9QZccdAxnpbRH-Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi!

(posting this to -hackers rather than to -docs since it seems a deeper
problem than just adjusting the docs)

I recently observed a case with standby corruption after upgrading pg12 to
pg14, which was presented in the form of XX001 errors on the new cluster's
standby nodes. e.g.:
ERROR: missing chunk number 0 for toast value 3228893903 in
pg_toast_79504413

Comparing the content of the data directory and checking files with md5sum,
I noticed that some files for some TOAST index have different content on
new standby nodes compared to the new primary – and interesting was the
fact all standbys had the same content. Just different compared to the
primary.

We used the "rsync --size-only" snippet from the docs
https://www.postgresql.org/docs/current/pgupgrade.html to upgrade standbys.

With "--size-only", 1 GiB files for tables and indexes obviously cannot be
reliably synchronized. In our case, we perform additional steps involving
logical replication, advancing primary to certain LSN position -- and
during that, we keep standbys down. This explains the increased corruption
risks. But I think these risks are present for those who just follow the
steps in the docs as is, and probably some fixes or improvements are needed
here.

The main question: why do we consider "rsync --size-only" as reliable in
the general case? May standby corruption happen if we use follow steps from
https://www.postgresql.org/docs/current/pgupgrade.html?

Considering several general situations:
1. For streaming replication:
a. if we shut down the primary first, based on the code in walsender.c
defining how shutdown even is handled, replicas should receive all the
changes?
b. if shut down standbys first (might be preferred if we run cluster
under Patroni control, to avoid unnecessary failover), then some changes
from the primary won't be received by standbys – and we do have standby
corruption risks
2. For replication based on WAL shipping, I don't think we can guarantee
that all changes are propagated to standbys.

The docs also have this:

> 9. Prepare for standby server upgrades
> If you are upgrading standby servers using methods outlined in section
Step 11, verify that the old standby servers are caught up by running
pg_controldata against the old primary and standby clusters. Verify that
the “Latest checkpoint location” values match in all clusters. (There will
be a mismatch if old standby servers were shut down before the old primary
or if the old standby servers are still running.) Also, make sure wal_level
is not set to minimal in the postgresql.conf file on the new primary
cluster.

– admitting that there might be mismatch. But if there is mismatch, rsync
--size-only is not going to help synchronize properly, right?

I was thinking about how to improve here, some ideas:
- "rsync --checksum" doesn't seem to be a good idea, it's, unfortunately,
very, very slow, though it would be the most reliable approach (but since
it's slow, I guess it's not worth even mentioning, crossing this out)
- we could remove "--size-only" and rely on default rsync behavior –
checking size and modification time; but how reliable would it be in
general case?
- make the step verifying “Latest checkpoint location” *after* shutting
down all nodes as mandatory, with instructions on how to avoid mismatch:
e.g., shut down primary first, disabling automated failover software, if
any, then run pg_controldata on standbys while they are running, and on
primary while it's already shut down (probably, different instructions are
needed for WAL shipping and streaming cases)
- probably, we should always run "rsync --checksum" for pg_wal
- I think, it's time to provide a snippet to run "rsync" in multiple
threads. A lot of installations today have many vCPUs and fast SSDs, and
running single-threaded rsync seems to be very slow (especially if we do
need to move away from "--size-only"). If it makes sense, I could come up
with some patch proposal for the docs
- it's probably time to implement support for standby upgrade in
pg_upgrade itself, finding some way to take care of standbys and moving
away from the need to run rsync or to rebuild standby nodes? Although, this
is just a raw idea without a proper proposal yet.

Does this make sense or I'm missing something and the current docs describe
a reliable process? (As I said, we have deviated from the process, to
involve logical replication, so I'm not 100% sure I'm right suspecting the
original procedure in having standby corruption risks.)

Thanks,
Nikolay Samokhvalov
Founder, Postgres.ai

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tristan Partin 2023-06-29 18:34:42 Re: Meson build updates
Previous Message Andres Freund 2023-06-29 17:35:32 Re: Meson build updates