Quick Links

Re: Changing the state of data checksums in a running cluster

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Daniel Gustafsson <daniel(at)yesql(dot)se>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Michael Banck <mbanck(at)gmx(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Changing the state of data checksums in a running cluster
Date:	2025-03-14 12:20:37
Message-ID:	3372a09c-d1f6-4974-ad60-eec15ee0c734@vondra.me
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 3/14/25 00:11, Tomas Vondra wrote:
> ...
>>>>>> One issue I ran into is the postmaster does not seem to be processing
>>>>>> the barriers, and thus not getting info about the data_checksum_version
>>>>>> changes.
>>>>>
>>>>> Makes sense, that seems like a pretty reasonable constraint for the barrier.
>>>>
>>>> Not sure I follow. What's a reasonable constraint?
>>>
>>> That the postmaster deosn't process them.
>>>
>>
>> OK, that means we need a way to "refresh" the value for new child
>> processses, similar to what my patch does. But I suspect there might be
>> a race condition - if the child process starts while processing the
>> XLOG_CHECKUMS record, it might happen to get the new value and then also
>> the barrier (if it does the "refresh" in between the XLogCtl update and
>> the barrier). Doesn't this need some sort of interlock, preventing this?
>>
>> The child startup would need to do this:
>>
>> 1) acquire lock
>> 2) reset barriers
>> 3) refresh the LocalDataChecksumValue (from XLogCtl)
>> 4) release lock
>>
>> while the walreceiver would do this
>>
>> 1) acquire lock
>> 2) update XLogCtl value
>> 3) emit barrier
>> 4) release lock
>>
>> Or is there a reason why this would be unnecessary?
>>
>
> I still think this might be a problem. I wonder if we could maybe
> leverage the barrier generation, to detect that we don't need to process
> this barrier, because we already got the value directly ...
>
> FWIW we'd have this problem even if postmaster was processing barriers,
> because there'd always be a "gap" between the fork and ProcSignalInit()
> registering the new process into the procsignal array.
>

I experimented with this a little bit, and unfortunately I ran into not
one, but two race conditions in this :-( I don't have reproducers, all
of this was done by manually adding sleep() calls / gdb breakpoints to
pause the processes for a while, but I'll try to explain what/why ...

1) race #1: SetDataChecksumsOn

The function (and all the other "SetDataChecksums" funcs) does this

SpinLockAcquire(&XLogCtl->info_lck);
XLogCtl->data_checksum_version = PG_DATA_CHECKSUM_VERSION;
SpinLockRelease(&XLogCtl->info_lck);

barrier = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_ON);

Now, imagine there's a sleep() before the EmitProcSignalBarrier. A new
process may start during that, and it'll read the current checksum value
from XLogCtl. And then the SetDataChecksumsOn() wakes up, and emits the
barrier. So far so good.

But the new backend is already registered in ProcSignal, so it'll get
the barrier too, and will try to set the local version to "on" again.
And kaboom - that hits the assert in AbsorbChecksumsOnBarrier():

Assert(LocalDataChecksumVersion ==
PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION);

The other "SetDataChecksums" have the same issue, except that in those
cases there are no asserts to trip. Only AbsorbChecksumsOnBarrier() has
such assert to check the state transition.

This is "ephemeral" in the sense that setting the value to "on" again
would be harmless, and indeed a non-assert build will run just fine.

2) race #2: InitPostgres

The InitPostgres does this:

InitLocalControldata();

ProcSignalInit(MyCancelKeyValid, MyCancelKey);

where InitLocalControldata gets the current checksum value from XLogCtl,
and ProcSignalInit registers the backend into the procsignal (which is
what barriers are based on).

Imagine there's a sleep() between these two calls, and the cluster does
not have checksums enabled. A backend will start, will read "off" from
XLogCtl, and then gets stuck on the sleep before it gets added to the
procsignal/barrier array.

Now, we enable checksums, and the instance goes through 'inprogress-on'
and 'on' states. This completes, and the backend wakes up and registers
itself into procsignal - but it won't get any barriers, of course.

So we end up with an instance with data_checksums="on", but this one
backend still believes data_checksums="on". This can cause a lot of
trouble, because it won't write blocks with checksums. I.e. this is
persistent data corruption.

I have been thinking about how to fix this. One way would be to
introduce some sort of locking, so that the two steps (update of the
XLogCtl version + barrier emit) and (local flag init + procsignal init)
would always happen atomically. So, something like this:

SpinLockAcquire(&XLogCtl->info_lck);
XLogCtl->data_checksum_version = PG_DATA_CHECKSUM_VERSION;
barrier = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_ON);
SpinLockRelease(&XLogCtl->info_lck);

and

SpinLockAcquire(&XLogCtl->info_lck);
InitLocalControldata();
ProcSignalInit(MyCancelKeyValid, MyCancelKey);
SpinLockRelease(&XLogCtl->info_lck);

But that seems pretty heavy-handed, it's definitely much more work while
holding a spinlock than I'm comfortable with, and I wouldn't be
surprised if there were deadlock cases etc. (FWIW I believe it needs to
use XLogCtl->info_lck, to make the value consistent with checkpoints.)

Anyway, I think a much simpler solution would be to reorder InitPostgres
like this:

ProcSignalInit(MyCancelKeyValid, MyCancelKey);

InitLocalControldata();

i.e. to first register into procsignal, and then read the new value.
AFAICS this guarantees we won't lose any checksum version updates. It
does mean we still can get a barrier for a value we've already seen, but
I think we should simply ignore this for the very first update.

Opinions? Other ideas how to fix this?

regards

--
Tomas Vondra

In response to

Re: Changing the state of data checksums in a running cluster at 2025-03-13 23:11:39 from Tomas Vondra

Responses

Re: Changing the state of data checksums in a running cluster at 2025-03-14 13:38:06 from Daniel Gustafsson

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Greg Sabino Mullane	2025-03-14 13:11:15	Re: Allow default \watch interval in psql to be configured
Previous Message	Nisha Moond	2025-03-14 12:13:30	Re: Enhance 'pg_createsubscriber' to retrieve databases automatically when no database is provided.