Quick Links

Re: Changing the state of data checksums in a running cluster

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Daniel Gustafsson <daniel(at)yesql(dot)se>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Michael Banck <mbanck(at)gmx(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Changing the state of data checksums in a running cluster
Date:	2025-03-12 13:16:41
Message-ID:	9591d5d0-1fb1-4b43-bf09-501b6df5d227@vondra.me
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 3/10/25 18:35, Tomas Vondra wrote:
> Hi,
>
> I continued stress testing this, as I was rather unsure why the assert
> failures reported in [1] disappeared. And I managed to reproduce that
> again, and I think I actually understand why it happens.
>
> I modified the test script (attached) to setup replication, not just a
> single instance. And then it does a bit of work, flips the checksums,
> restarts the instances (randomly, fast/immediate), verifies the checkums
> and so on. And I can hit this assert in AbsorbChecksumsOnBarrier()
> pretty easily:
>
> Assert(LocalDataChecksumVersion ==
> PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION);
>
> The reason is pretty simple - this happens on the standby:
>
> 1) standby receives XLOG_CHECKSUMS and applies it from 2 to 1 (i.e. it
> sets ControlFile->data_checksum_version from "inprogress-on" to "on"),
> and signals all other processes to refresh LocalDataChecksumVersion
>
> 2) the control file gets written to disk for whatever reason (redo does
> this in a number of places)
>
> 3) standby gets restarted with "immediate" mode (I'm not sure if this
> can happen with "fast" mode, I only recall seeing "immediate")
>
> 4) the standby receives the XLOG_CHECKSUMS record *again*, updates the
> ControlFile->data_checksum_version (to the same value, no noop), and
> then signals the other processes again
>
> 5) the other processes already have LocalDataChecksumVersion=1 (on), but
> the assert says it should be 2 (inprogress-on) => kaboom
>
> I believe this can happen for changes in either direction, although the
> window while disabling checksums is more narrow.
>
>
> I'm not sure what to do about this. Maybe we could relax the assert in
> some way? But that seems a bit ... possibly risky. It's not necessarily
> true we'll see the immediately preceding checksum state, we might see a
> couple updates back (if the control file was not updated in between).
>
> Could this affect checksum verification during recovery? Imagine we get
> to the "on" state, the controlfile gets flushed, and then the standby
> restarts and starts receiving older records again. The control file says
> we should be verifying checksums, but couldn't some of the writes have
> been lost (and so the pages may not have a valid checksum)?
>
> The one idea I have is to create an "immediate" restartpoint in
> xlog_redo() right after XLOG_CHECKSUMS updates the control file. AFAICS
> a "spread" restartpoint would not be enough, because then we could get
> into the same situation with a control file of sync (ahead of WAL) after
> a restart. It'd not be cheap, but it should be a rare operation ...
>
> I was wondering if the primary has the same issue, but AFAICS it does
> not. It flushes the control file in only a couple places, I couldn't
> think of a way to get it out of sync.
>

I continued investigating this and experimenting with alternative
approaches, and I think the way the patch relies on ControlFile is not
quite right. That is, it always sets data_checksum_version to the last
("current") value, but that's not what ControlFile is for ...

The ControlFile is meant to be a safe/consistent state, e.g. for crash
recovery. By setting data_checksum_version to the "last" value we've
seen, that's broken - if the control file gets persisted (haven't seen
this on primary, but pretty common on replica, per the report), the
recovery will start with a "future" data_checksum_version value. Which
is wrong - we'll read the XLOG_CHECKUMS record, triggering the assert. I
suspect it might also lead to confusion whether checksums should be
verified or not.

In my earlier message I suggested maybe this could be solved by forcing
a checkpoint every time we see the XLOG_CHECKUMS record (or rather a
restart point, as it'd be on the replica). Sure, that would have some
undesirable consequences (forcing an immediate checkpoint is not cheap,
and the redo would need to wait for that). But the assumption was it'd
be very rare (how often you enable checksums?), so this cost might be
acceptable.

But when I started experimenting with this, I realized it has a couple
other issues:

1) We can't do the checkpoint/restartpoint when handling XLOG_CHECKUMS,
because that'd mean we see this XLOG record again, which we don't want.
So the checkpoint would need to happen the *next* time we update the
control file.

2) But we can't trigger a checkpoint from UpdateControlFile, because of
locking (because CreateCheckPoint also calls UpdateControlFile). So this
would require much more invasive changes to all places updating the
control file.

3) It does not resolve the mismatch with using ControlFile to store
"current" data_checksums_version value.

4) ... probably more minor issues that I already forgot about.

In the end, I decided to try to rework this by storing the current value
elsewhere, and only updating the "persistent" value in the control file
when necessary.

XLogCtl seemed like a good place, so I used that - after all, it's a
value from XLOG. Maybe there's a better place? I'm open to suggestions,
but it does not really affect the overall approach.

So all the places now update XLogCtl->data_checksums_version instead of
the ControlFile, and also query this flag for *current* value.

The value is copied from XLogCtl to ControlFile when creating checkpoint
(or restartpoint), and the control file is persisted. This means (a) the
current value can't get written to the control file prematurely, and (b)
the value is consistent with the checkpoint (i.e. with the LSN where we
start crash recovery, if needed).

The attached 0005 patch implements this. It's a bit WIP and I'm sure it
can be improved, but I'm yet to see a single crash/failure with it. With
the original patch I've seen crashes after 5-10 loops (i.e. a couple
minutes), I'm now at loop 1000 and it's still OK.

I believe the approach is correct, but the number of possible states
(e.g. after a crash/restart) seems a bit complex. I wonder if there's a
better way to handle this, but I can't think of any. Ideas?

One issue I ran into is the postmaster does not seem to be processing
the barriers, and thus not getting info about the data_checksum_version
changes. That's fine until it needs to launch a child process (e.g. a
walreceiver), which will then see the LocalDataChecksumVersion as of the
start of the instance, not the "current" one. I fixed this by explicitly
refreshing the value in postmaster_child_launch(), but maybe I'm missing
something. (Also, EXEC_BACKEND may need to handle this too.)

regards

--
Tomas Vondra

Attachment	Content-Type	Size
v20250312-0001-Online-enabling-and-disabling-of-data-chec.patch	text/x-patch	145.2 KB
v20250312-0002-review-fixes.patch	text/x-patch	6.5 KB
v20250312-0003-pgindent.patch	text/x-patch	1.9 KB
v20250312-0004-perltidy.patch	text/x-patch	1.8 KB
v20250312-0005-data_checksum_version-reworks.patch	text/x-patch	12.2 KB
v20250312-0006-debug.patch	text/x-patch	9.2 KB
checksum-test.sh	application/x-shellscript	5.9 KB

In response to

Re: Changing the state of data checksums in a running cluster at 2025-03-10 17:35:56 from Tomas Vondra

Responses

Re: Changing the state of data checksums in a running cluster at 2025-03-13 09:54:28 from Daniel Gustafsson

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Matheus Alcantara	2025-03-12 13:17:19	Re: RFC: Additional Directory for Extensions
Previous Message	Dagfinn Ilmari Mannsåker	2025-03-12 13:08:50	Re: Changing the state of data checksums in a running cluster