Re: Changing the state of data checksums in a running cluster

From: Daniel Gustafsson <daniel(at)yesql(dot)se>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Michael Banck <mbanck(at)gmx(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Changing the state of data checksums in a running cluster
Date: 2025-03-14 13:38:06
Message-ID: 3E9E1AF5-5344-45C5-9B38-A166CE57472E@yesql.se
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On 14 Mar 2025, at 13:20, Tomas Vondra <tomas(at)vondra(dot)me> wrote:

> I experimented with this a little bit, and unfortunately I ran into not
> one, but two race conditions in this :-( I don't have reproducers, all
> of this was done by manually adding sleep() calls / gdb breakpoints to
> pause the processes for a while, but I'll try to explain what/why ...

Ugh. Thanks for this!

> 1) race #1: SetDataChecksumsOn
>
> The function (and all the other "SetDataChecksums" funcs) does this
>
> SpinLockAcquire(&XLogCtl->info_lck);
> XLogCtl->data_checksum_version = PG_DATA_CHECKSUM_VERSION;
> SpinLockRelease(&XLogCtl->info_lck);
>
> barrier = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_ON);
>
> Now, imagine there's a sleep() before the EmitProcSignalBarrier. A new
> process may start during that, and it'll read the current checksum value
> from XLogCtl. And then the SetDataChecksumsOn() wakes up, and emits the
> barrier. So far so good.
>
> But the new backend is already registered in ProcSignal, so it'll get
> the barrier too, and will try to set the local version to "on" again.
> And kaboom - that hits the assert in AbsorbChecksumsOnBarrier():
>
> Assert(LocalDataChecksumVersion ==
> PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION);
>
> The other "SetDataChecksums" have the same issue, except that in those
> cases there are no asserts to trip. Only AbsorbChecksumsOnBarrier() has
> such assert to check the state transition.
>
> This is "ephemeral" in the sense that setting the value to "on" again
> would be harmless, and indeed a non-assert build will run just fine.

As mentioned off-list, being able to loosen the restriction for the first
barrier seen seem like a good way to keep this assertion. Removing it is of
course the alternative solution, as it's not causing any issues, but given how
handy it's been to find actual issues it would be good to be able to keep it.

> 2) race #2: InitPostgres
>
> The InitPostgres does this:
>
> InitLocalControldata();
>
> ProcSignalInit(MyCancelKeyValid, MyCancelKey);
>
> where InitLocalControldata gets the current checksum value from XLogCtl,
> and ProcSignalInit registers the backend into the procsignal (which is
> what barriers are based on).
>
> Imagine there's a sleep() between these two calls, and the cluster does
> not have checksums enabled. A backend will start, will read "off" from
> XLogCtl, and then gets stuck on the sleep before it gets added to the
> procsignal/barrier array.
>
> Now, we enable checksums, and the instance goes through 'inprogress-on'
> and 'on' states. This completes, and the backend wakes up and registers
> itself into procsignal - but it won't get any barriers, of course.
>
> So we end up with an instance with data_checksums="on", but this one
> backend still believes data_checksums="on". This can cause a lot of
> trouble, because it won't write blocks with checksums. I.e. this is
> persistent data corruption.
>
> I have been thinking about how to fix this. One way would be to
> introduce some sort of locking, so that the two steps (update of the
> XLogCtl version + barrier emit) and (local flag init + procsignal init)
> would always happen atomically. So, something like this:
>
> SpinLockAcquire(&XLogCtl->info_lck);
> XLogCtl->data_checksum_version = PG_DATA_CHECKSUM_VERSION;
> barrier = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_ON);
> SpinLockRelease(&XLogCtl->info_lck);
>
> and
>
> SpinLockAcquire(&XLogCtl->info_lck);
> InitLocalControldata();
> ProcSignalInit(MyCancelKeyValid, MyCancelKey);
> SpinLockRelease(&XLogCtl->info_lck);
>
> But that seems pretty heavy-handed, it's definitely much more work while
> holding a spinlock than I'm comfortable with, and I wouldn't be
> surprised if there were deadlock cases etc. (FWIW I believe it needs to
> use XLogCtl->info_lck, to make the value consistent with checkpoints.)

Yeah, that seems quite likely to introduce a new set if issues.

> Anyway, I think a much simpler solution would be to reorder InitPostgres
> like this:
>
> ProcSignalInit(MyCancelKeyValid, MyCancelKey);
>
> InitLocalControldata();

Agreed.

> i.e. to first register into procsignal, and then read the new value.
> AFAICS this guarantees we won't lose any checksum version updates. It
> does mean we still can get a barrier for a value we've already seen, but
> I think we should simply ignore this for the very first update.

Calling functions with sideeffects in setting state seems like a bad idea
before ProcSignalInit has run, that's thinko on my part in this patch. Your
solution of reordering seems like the right way to handle this.

--
Daniel Gustafsson

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Evgeny 2025-03-14 13:43:31 Re: Elimination of the repetitive code at the SLRU bootstrap functions.
Previous Message Nazir Bilal Yavuz 2025-03-14 13:25:00 Re: Add pg_buffercache_evict_all() and pg_buffercache_mark_dirty[_all]() functions