Re: Changing the state of data checksums in a running cluster

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Daniel Gustafsson <daniel(at)yesql(dot)se>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Michael Banck <mbanck(at)gmx(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Changing the state of data checksums in a running cluster
Date: 2025-03-13 16:26:47
Message-ID: 854de38e-8899-4095-8245-6531d86060c9@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3/13/25 13:32, Daniel Gustafsson wrote:
>> On 13 Mar 2025, at 12:03, Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>> On 3/13/25 10:54, Daniel Gustafsson wrote:
>>>> On 12 Mar 2025, at 14:16, Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>
>>>> I believe the approach is correct, but the number of possible states
>>>> (e.g. after a crash/restart) seems a bit complex. I wonder if there's a
>>>> better way to handle this, but I can't think of any. Ideas?
>>>
>>> Not sure if this moves the needle too far in terms of complexity wrt to the
>>> previous version of the patch, there were already multiple copies.
>>
>> It does add one more place (XLogCtl->data_checksum_version) to store the
>> current state, so it's not that much more complex, ofc. But I was not
>> really comparing this to the previous patch version, I meant the state
>> space in general - all possible combinations of all the flags (control
>> file, local + xlogct).
>
> Fair point.
>
>> I wonder if it might be possible to have a more thorough validation of
>> the transitions. We already have that for the LocalDataChecksumVersion,
>> thanks to the asserts - and it was damn useful, otherwise we would not
>> have noticed this issue for a long time, I think.
>>
>> I wonder if we can have similar checks for the other flags. I'm pretty
>> sure we can have the same checks for XLogCtl, right?
>
> I don't see why not, they should abide by the same rules.
>

OK, I'll add these asserts.

>> I'm not quite sure
>> about ControlFile - can't that "skip" some of the changes, e.g. if we do
>> (enable->disable->enable) within a single checkpoint? Need to check.
>
> For enable->disable->enable within a single checkpoint then ControlFile should
> never see the disable state.
>

Hmm, that means we can't have the same checks for the ControlFile
fields, but I don't think that's a problem. We've verified the "path" to
that (on the XLogCtl field), so that seems fine.

>> This also reminds me I had a question about the barrier - can't it
>> happen a process gets to process multiple barriers at the same time? I
>> mean, let's say it gets stuck for a while, and the cluster happens to go
>> through disable+enable. Won't it then see both barriers? That'd be a
>> problem, because the core processes the barriers in the order determined
>> by the enum value, not in the order the barriers happened. Which means
>> it might break the expected state transitions again (and end with the
>> wrong local value). I haven't tried, though.
>
> Interesting, that seems like a general deficiency in the barriers, surely
> processing them in-order would be more intuitive? That would probably require
> some form of Lamport clock though.
>

Yeah, that seems non-trivial. What if we instead ensured there can't be
two barriers set at the same time? Say, if we (somehow) ensured all
processes saw the previous barrier before allowing a new one, we would
not have this issue, right?

But I don't know what would be a good way to ensure this. Is there a way
to check if all processes saw the barrier? Any ideas?

>>>> One issue I ran into is the postmaster does not seem to be processing
>>>> the barriers, and thus not getting info about the data_checksum_version
>>>> changes.
>>>
>>> Makes sense, that seems like a pretty reasonable constraint for the barrier.
>>
>> Not sure I follow. What's a reasonable constraint?
>
> That the postmaster deosn't process them.
>

OK, that means we need a way to "refresh" the value for new child
processses, similar to what my patch does. But I suspect there might be
a race condition - if the child process starts while processing the
XLOG_CHECKUMS record, it might happen to get the new value and then also
the barrier (if it does the "refresh" in between the XLogCtl update and
the barrier). Doesn't this need some sort of interlock, preventing this?

The child startup would need to do this:

1) acquire lock
2) reset barriers
3) refresh the LocalDataChecksumValue (from XLogCtl)
4) release lock

while the walreceiver would do this

1) acquire lock
2) update XLogCtl value
3) emit barrier
4) release lock

Or is there a reason why this would be unnecessary?

>>>> That's fine until it needs to launch a child process (e.g. a
>>>> walreceiver), which will then see the LocalDataChecksumVersion as of the
>>>> start of the instance, not the "current" one. I fixed this by explicitly
>>>> refreshing the value in postmaster_child_launch(), but maybe I'm missing
>>>> something. (Also, EXEC_BACKEND may need to handle this too.)
>>>
>>> The pg_checksums test is failing for me on this version due to the GUC not
>>> being initialized, don't we need something like the below as well? (With a
>>> comment explaining why ReadControlFile wasn't enough.)
>>>
>>> @@ -5319,6 +5319,7 @@ LocalProcessControlFile(bool reset)
>>> Assert(reset || ControlFile == NULL);
>>> ControlFile = palloc(sizeof(ControlFileData));
>>> ReadControlFile();
>>> + SetLocalDataChecksumVersion(ControlFile->data_checksum_version);
>>>
>>
>> Yeah, I think this (or something like it) is missing.
>
> Thanks for confirming.
>
>>> +uint32
>>> +GetLocalDataChecksumVersion(void)
>>> +{
>>> + return LocalDataChecksumVersion;
>>> +}
>>> +
>>> +/*
>>> + * Get the *current* data_checksum_version (might not be written to control
>>> + * file yet).
>>> + */
>>> +uint32
>>> +GetCurrentDataChecksumVersion(void)
>>> +{
>>> + return XLogCtl->data_checksum_version;
>>> +}
>>> I wonder if CachedDataChecksumVersion would be more appropriate to distinguish
>>> it from the Current value, and also to make appear less like actual copies of
>>> controlfile values like LocalMinRecoveryPoint. Another thought is if we should
>>> have the GetLocalDataChecksumVersion() API? GetCurrentDataChecksumVersion()
>>> should be a better API no?
>>>
>>
>> FWIW those functions are for debug logging only, I needed to print the
>> values in a couple places outside xlog.c. I don't intend to make that
>> part of the patch.
>
> Ah, gotcha, I never applied the debug patch from the patchset so I figured this
> was a planned API. The main question still stands though, if LocalDataCheckXX
> can be confusing and CachedDataCheckXX would be better in order to
> distinguish it from actual controlfile copies?
>

Yeah, I'll think about the naming.

regards

--
Tomas Vondra

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Devulapalli, Raghuveer 2025-03-13 16:37:35 RE: Improve CRC32C performance on SSE4.2
Previous Message Jacob Champion 2025-03-13 16:23:10 Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible