Re: Changing the state of data checksums in a running cluster

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Daniel Gustafsson <daniel(at)yesql(dot)se>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Michael Banck <mbanck(at)gmx(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Changing the state of data checksums in a running cluster
Date: 2025-03-10 17:35:56
Message-ID: b2641e15-3230-4870-9162-8d6f8df7c8ef@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I continued stress testing this, as I was rather unsure why the assert
failures reported in [1] disappeared. And I managed to reproduce that
again, and I think I actually understand why it happens.

I modified the test script (attached) to setup replication, not just a
single instance. And then it does a bit of work, flips the checksums,
restarts the instances (randomly, fast/immediate), verifies the checkums
and so on. And I can hit this assert in AbsorbChecksumsOnBarrier()
pretty easily:

Assert(LocalDataChecksumVersion ==
PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION);

The reason is pretty simple - this happens on the standby:

1) standby receives XLOG_CHECKSUMS and applies it from 2 to 1 (i.e. it
sets ControlFile->data_checksum_version from "inprogress-on" to "on"),
and signals all other processes to refresh LocalDataChecksumVersion

2) the control file gets written to disk for whatever reason (redo does
this in a number of places)

3) standby gets restarted with "immediate" mode (I'm not sure if this
can happen with "fast" mode, I only recall seeing "immediate")

4) the standby receives the XLOG_CHECKSUMS record *again*, updates the
ControlFile->data_checksum_version (to the same value, no noop), and
then signals the other processes again

5) the other processes already have LocalDataChecksumVersion=1 (on), but
the assert says it should be 2 (inprogress-on) => kaboom

I believe this can happen for changes in either direction, although the
window while disabling checksums is more narrow.

I'm not sure what to do about this. Maybe we could relax the assert in
some way? But that seems a bit ... possibly risky. It's not necessarily
true we'll see the immediately preceding checksum state, we might see a
couple updates back (if the control file was not updated in between).

Could this affect checksum verification during recovery? Imagine we get
to the "on" state, the controlfile gets flushed, and then the standby
restarts and starts receiving older records again. The control file says
we should be verifying checksums, but couldn't some of the writes have
been lost (and so the pages may not have a valid checksum)?

The one idea I have is to create an "immediate" restartpoint in
xlog_redo() right after XLOG_CHECKSUMS updates the control file. AFAICS
a "spread" restartpoint would not be enough, because then we could get
into the same situation with a control file of sync (ahead of WAL) after
a restart. It'd not be cheap, but it should be a rare operation ...

I was wondering if the primary has the same issue, but AFAICS it does
not. It flushes the control file in only a couple places, I couldn't
think of a way to get it out of sync.

regards

[1]
https://www.postgresql.org/message-id/e4dbcb2c-e04a-4ba2-bff0-8d979f55960e%40vondra.me

--
Tomas Vondra

Attachment Content-Type Size
test.sh application/x-shellscript 4.5 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Sami Imseih 2025-03-10 17:46:58 Re: track generic and custom plans in pg_stat_statements
Previous Message Sami Imseih 2025-03-10 17:20:57 Re: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET