Quick Links

Re: Changing the state of data checksums in a running cluster

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Daniel Gustafsson <daniel(at)yesql(dot)se>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Michael Banck <mbanck(at)gmx(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Changing the state of data checksums in a running cluster
Date:	2025-03-13 11:03:29
Message-ID:	df809690-2297-469a-a6df-21aa8394b47a@vondra.me
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 3/13/25 10:54, Daniel Gustafsson wrote:
>> On 12 Mar 2025, at 14:16, Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>
>> I continued investigating this and experimenting with alternative
>> approaches, and I think the way the patch relies on ControlFile is not
>> quite right. That is, it always sets data_checksum_version to the last
>> ("current") value, but that's not what ControlFile is for ...
>
> Agreed, that's a thinko on my part. Reading it makes it clear, but I had
> failed to see that when hacking =/
>

It wasn't obvious to me either, until I managed to trigger the failure
and investigated the root cause.

>> XLogCtl seemed like a good place, so I used that - after all, it's a
>> value from XLOG. Maybe there's a better place? I'm open to suggestions,
>> but it does not really affect the overall approach.
>
> Seems like a good place for it.
>

>> So all the places now update XLogCtl->data_checksums_version instead of
>> the ControlFile, and also query this flag for *current* value.
>>
>> The value is copied from XLogCtl to ControlFile when creating checkpoint
>> (or restartpoint), and the control file is persisted. This means (a) the
>> current value can't get written to the control file prematurely, and (b)
>> the value is consistent with the checkpoint (i.e. with the LSN where we
>> start crash recovery, if needed).
>
> +1
>

OK. I still want to go over the places once more and double check it
sets the ControlFile value to the right data_checksum_version.

>> The attached 0005 patch implements this. It's a bit WIP and I'm sure it
>> can be improved, but I'm yet to see a single crash/failure with it. With
>> the original patch I've seen crashes after 5-10 loops (i.e. a couple
>> minutes), I'm now at loop 1000 and it's still OK.
>
> Given how successful this test has been at stressing out errors that is indeed
> comforting to hear.
>

It is. I plan to vary the stress test a bit more, and also run it on
another machine (rpi5, to get some non-x86 testing).

>> I believe the approach is correct, but the number of possible states
>> (e.g. after a crash/restart) seems a bit complex. I wonder if there's a
>> better way to handle this, but I can't think of any. Ideas?
>
> Not sure if this moves the needle too far in terms of complexity wrt to the
> previous version of the patch, there were already multiple copies.
>

It does add one more place (XLogCtl->data_checksum_version) to store the
current state, so it's not that much more complex, ofc. But I was not
really comparing this to the previous patch version, I meant the state
space in general - all possible combinations of all the flags (control
file, local + xlogct).

I wonder if it might be possible to have a more thorough validation of
the transitions. We already have that for the LocalDataChecksumVersion,
thanks to the asserts - and it was damn useful, otherwise we would not
have noticed this issue for a long time, I think.

I wonder if we can have similar checks for the other flags. I'm pretty
sure we can have the same checks for XLogCtl, right? I'm not quite sure
about ControlFile - can't that "skip" some of the changes, e.g. if we do
(enable->disable->enable) within a single checkpoint? Need to check.

This also reminds me I had a question about the barrier - can't it
happen a process gets to process multiple barriers at the same time? I
mean, let's say it gets stuck for a while, and the cluster happens to go
through disable+enable. Won't it then see both barriers? That'd be a
problem, because the core processes the barriers in the order determined
by the enum value, not in the order the barriers happened. Which means
it might break the expected state transitions again (and end with the
wrong local value). I haven't tried, though.

>> One issue I ran into is the postmaster does not seem to be processing
>> the barriers, and thus not getting info about the data_checksum_version
>> changes.
>
> Makes sense, that seems like a pretty reasonable constraint for the barrier.
>

Not sure I follow. What's a reasonable constraint?

>> That's fine until it needs to launch a child process (e.g. a
>> walreceiver), which will then see the LocalDataChecksumVersion as of the
>> start of the instance, not the "current" one. I fixed this by explicitly
>> refreshing the value in postmaster_child_launch(), but maybe I'm missing
>> something. (Also, EXEC_BACKEND may need to handle this too.)
>
> The pg_checksums test is failing for me on this version due to the GUC not
> being initialized, don't we need something like the below as well? (With a
> comment explaining why ReadControlFile wasn't enough.)
>
> @@ -5319,6 +5319,7 @@ LocalProcessControlFile(bool reset)
> Assert(reset || ControlFile == NULL);
> ControlFile = palloc(sizeof(ControlFileData));
> ReadControlFile();
> + SetLocalDataChecksumVersion(ControlFile->data_checksum_version);
>

Yeah, I think this (or something like it) is missing.

> A few comments on the patchset:
>
> + * Local state fror Controlfile data_checksum_version. After initialization
> s/fror/for/. Also, this is no longer true as it's a local copy of the XlogCtl
> value and not the Controlfile value (which may or may not be equal).
>
>
> - if (ControlFile->data_checksum_version == PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION)
> + if (XLogCtl->data_checksum_version == PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION)
> ereport(WARNING,
> (errmsg("data checksums are being enabled, but no worker is running"),
> errhint("If checksums were being enabled during shutdown then processing must be manually restarted.")));
> Reading this made me realize what a terrible error message I had placed there,
> the hint is good but the message says checksums are being enabled but they're
> not being enabled. Maybe "data checksums are marked as being in-progress, but
> no worker is running"
>

Makes sense, will reword.

>
> +uint32
> +GetLocalDataChecksumVersion(void)
> +{
> + return LocalDataChecksumVersion;
> +}
> +
> +/*
> + * Get the *current* data_checksum_version (might not be written to control
> + * file yet).
> + */
> +uint32
> +GetCurrentDataChecksumVersion(void)
> +{
> + return XLogCtl->data_checksum_version;
> +}
> I wonder if CachedDataChecksumVersion would be more appropriate to distinguish
> it from the Current value, and also to make appear less like actual copies of
> controlfile values like LocalMinRecoveryPoint. Another thought is if we should
> have the GetLocalDataChecksumVersion() API? GetCurrentDataChecksumVersion()
> should be a better API no?
>

FWIW those functions are for debug logging only, I needed to print the
values in a couple places outside xlog.c. I don't intend to make that
part of the patch.

regards

--
Tomas Vondra

In response to

Re: Changing the state of data checksums in a running cluster at 2025-03-13 09:54:28 from Daniel Gustafsson

Responses

Re: Changing the state of data checksums in a running cluster at 2025-03-13 12:32:20 from Daniel Gustafsson

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Xuneng Zhou	2025-03-13 11:33:24	Fwd: [BUG]: the walsender does not update its IO statistics until it exits
Previous Message	Nisha Moond	2025-03-13 11:00:00	Re: Conflict detection for multiple_unique_conflicts in logical replication