Quick Links

Re: Changing the state of data checksums in a running cluster

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Daniel Gustafsson <daniel(at)yesql(dot)se>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Michael Banck <mbanck(at)gmx(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Changing the state of data checksums in a running cluster
Date:	2025-03-15 15:50:02
Message-ID:	f528413c-477a-4ec3-a0df-e22a80ffbe41@vondra.me
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 3/14/25 15:06, Daniel Gustafsson wrote:
>> On 14 Mar 2025, at 14:38, Daniel Gustafsson <daniel(at)yesql(dot)se> wrote:
>>> On 14 Mar 2025, at 13:20, Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>
>>> This is "ephemeral" in the sense that setting the value to "on" again
>>> would be harmless, and indeed a non-assert build will run just fine.
>>
>> As mentioned off-list, being able to loosen the restriction for the first
>> barrier seen seem like a good way to keep this assertion. Removing it is of
>> course the alternative solution, as it's not causing any issues, but given how
>> handy it's been to find actual issues it would be good to be able to keep it.
>>
>>> i.e. to first register into procsignal, and then read the new value.
>>> AFAICS this guarantees we won't lose any checksum version updates. It
>>> does mean we still can get a barrier for a value we've already seen, but
>>> I think we should simply ignore this for the very first update.
>>
>> Calling functions with sideeffects in setting state seems like a bad idea
>> before ProcSignalInit has run, that's thinko on my part in this patch. Your
>> solution of reordering seems like the right way to handle this.
>
> 0006 in the attached version is what I have used when testing the above, along
> with an update to the copyright year which I had missed doing earlier. It also
> contains the fix in LocalProcessControlFile which I had in my local tree, I
> think we need something like that at least.
>

Thanks, here's an updated patch version - I squashed all the earlier
parts, but I kept your changes and my adjustments separate, for clarity.
A couple comments:

1) I don't think the comment before InitialDataChecksumTransition was
entirely accurate, because it said we can see the duplicate state only
for "on" state. But AFAICS we can see duplicate values for any states,
except that we only have an assert for the "on" so we don't notice the
other cases. I wonder if we could strengthen this a bit, by adding some
asserts for the other states too.

2) I admit it's rather subjective, but I didn't like how you did the
assert in AbsorbChecksumsOnBarrier. But looking at it now in the diff,
maybe it was more readable ...

3) I renamed InitLocalControldata() to InitLocalDataChecksumVersion().
The name was entirely misleading, because it now initializes the flag in
XLogCtl, it has nothing to do with control file.

4) I realized AuxiliaryProcessMainCommon() may be a better place to
initialize the checksum flag for non-backend processes. In fact, doing
it in postmaster_child_launch() had the same race condition because it
happened before ProcSignalInit().

I'm sure there's cleanup possible in a bunch of places, but the really
bad thing is I realized the handling on a standby is not quite correct.
I don't know what exactly is happening, there's too many moving bits,
but here's what I see ...

Every now and then, after restarting the standby, it logs a bunch of
page verification failures. Stuff like this:

WARNING: page verification failed, calculated checksum 9856 but
expected 0
CONTEXT: WAL redo at 0/3447BA8 for Heap2/VISIBLE:
snapshotConflictHorizon: 0, flags: 0x03; blkref #0: rel
1663/16384/16401, fork 2, blk 0 FPW; blkref #1: rel
1663/16384/16401, blk 0
WARNING: page verification failed, LSN 0/CF54C10

This is after an immediate shutdown, but I've seen similar failures for
fast shutdowns too (the root causes may be different / may need a
different fix, not sure).

The instance restarts, and the "startup" process starts recovery

LOG: redo starts at 0/2000028

This matches LSN from the very first start of the standby - there were
no restart points since then, apparently. And since then the primary did
this with the checksums (per pg_waldump):

lsn: 0/0ECCFC48, prev 0/0ECCFBA0, desc: CHECKSUMS inprogress-off
lsn: 0/0ECD0168, prev 0/0ECD0128, desc: CHECKSUMS off

The instance already saw both records before the immediate shutdown (per
the logging in patch 0004), but after the restart the instance goes back
to having checksums enabled again

data_checksum_version = 1

Which is correct, because it starts at 0/2000028, which is before either
of the XLOG_CHECKSUMS records. But then at 0/3447BA8 (which is *before*
either of the checksum changes) it tries to read a page from disk, and
hits a checksum error. That page is from the future (per the page LSN
logged by patch 0004), but it's still before both XLOG_CHECKSUMS
messages. So how come the page has pd_checksum 0?

I'd have understood if the page came "broken" from the primary, but I've
not seen a single page verification failure on that side (and it's
subject to the same fast/immediate restarts, etc).

I wonder if this might be related to how we enforce checkpoints only
when setting the checksums to "on" on the primary. Maybe that's safe on
primary but not on a standby?

FWIW I've seen similar issues for "fast" shutdowns too - at least the
symptoms are similar, but the mechanism might be a bit different. In
particular, I suspect there's some sort of thinko in updating the
data_checksum_version in the control file, but I can't put my finger on
it yet.

Another thing I noticed is this comment in CreateRestartPoint(), before
one of the early exits:

/*
* If the last checkpoint record we've replayed is already our last
* restartpoint, we can't perform a new restart point. We still update
* minRecoveryPoint in that case, so that if this is a shutdown restart
* point, we won't start up earlier than before. That's not strictly
* necessary, but when hot standby is enabled, it would be rather weird
* if the database opened up for read-only connections at a
* point-in-time before the last shutdown. Such time travel is still
* possible in case of immediate shutdown, though.
* ...

I wonder if this "time travel backwards" might be an issue for this too,
because it might mean we end up picking the wrong data_checksum_version
from the control file. In any case, if this happens, we don't get to the
ControlFile->data_checksum_version update a bit further down. And
there's another condition that can skip that.

I'll continue investigating this next week, but at this point I'm quite
confused and would be grateful for any insights ...

regards

--
Tomas Vondra

Attachment	Content-Type	Size
v20250315-0001-Online-enabling-and-disabling-of-data-chec.patch	text/x-patch	150.2 KB
v20250315-0002-Reviewfixups.patch	text/x-patch	4.6 KB
v20250315-0003-reworks.patch	text/x-patch	8.6 KB
v20250315-0004-debug-stuff.patch	text/x-patch	17.3 KB
test.sh	application/x-shellscript	8.3 KB

In response to

Re: Changing the state of data checksums in a running cluster at 2025-03-14 14:06:52 from Daniel Gustafsson

Responses

Re: Changing the state of data checksums in a running cluster at 2025-03-15 16:26:12 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2025-03-15 16:15:36	Re: Update Unicode data to Unicode 16.0.0
Previous Message	David E. Wheeler	2025-03-15 15:21:06	Re: PG_CFLAGS rpath Passthrough Issue