Re: Changing the state of data checksums in a running cluster

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Daniel Gustafsson <daniel(at)yesql(dot)se>
Cc: Michael Banck <mbanck(at)gmx(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Changing the state of data checksums in a running cluster
Date: 2024-10-09 10:41:02
Message-ID: 2bf72a05-a2ce-4470-ba5a-adff3f087b55@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 10/8/24 22:38, Daniel Gustafsson wrote:
>
>> 7) controldata.c - maybe this
>>
>> if (oldctrl->data_checksum_version == 2)
>>
>> should use PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION instead of the magic
>> constant? For "off" we use "0" which seems somewhat acceptable, but for
>> other values it's less obvious what the meaning is.
>
> It doesn't seem clean to include storage/bufpage.h in pg_upgrade, I wonder if
> we should move (or mirror) the checksum versions to storage/checksum_impl.h to
> make them available to frontend and backend tools?
>

+1 to have checksum_impl.h

>
>> But there also seems to be a more serious problem with recovery. I did a
>> simple script that does a loop of
>>
>> * start a cluster
>> * initialize a small pgbench database (scale 1 - 100)
>> * run "long" pgbench
>> * call pg_enable_data_checksums(), wait for it to complete
>> * stop the cluster with "-m immediate"
>> * start the cluster
>>
>> And this unfortunately hits this assert:
>>
>> bool
>> AbsorbChecksumsOnBarrier(void)
>> {
>> Assert(LocalDataChecksumVersion ==
>> PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION);
>> LocalDataChecksumVersion = PG_DATA_CHECKSUM_VERSION;
>> return true;
>> }
>>
>> Based on our short discussion about this, the controlfile gets updated
>> right after pg_enable_data_checksums() completes. The immediate stop
>> however forces a recovery since the last checkpoint, which means we see
>> the XLOG_CHECKSUMS WAL message again, and set the barrier. And then we
>> exit recovery, try to start checkpointer and it trips over this, because
>> the control file already has the "on" value :-(
>>
>> I'm not sure what's the best way to fix this. Would it be possible to
>> remember we saw the XLOG_CHECKSUMS during recovery, and make the assert
>> noop in that case? Or not set the barrier when exiting recovery. I'm not
>> sure the relaxed assert would remain meaningful, though. What would it
>> check on standbys, for example?
>>
>> Maybe a better way would be to wait for a checkpoint before updating the
>> controlfile, similar to what we do at the beginning? Possibly even with
>> the same "fast=true/false" logic. That would prevent us from seeing the
>> XLOG_CHECKSUMS wal record with the updated flag. It would extend the
>> "window" where a crash would mean we have to redo the checksums, but I
>> don't think that matters much. For small databases who cares, and for
>> large databases it should not be a meaningful difference (setting the
>> checksums already ran over multiple checkpoints, so one checkpoint is
>> not a big difference).
>
> The more I think about it the more I think that updating the control file is
> the wrong thing to do for this patch, it should only change the state in memory
> and let the checkpoints update the controlfile. The attached fixes that and I
> can no longer reproduce the assertion failure you hit.
>

I think leaving the update of controlfile to checkpointer is correct,
and probably the only way to make this correct (without race
conditions). We need to do that automatically with the checkpoint (which
updates the redo LSN, guaranteeing we won't see the XLOG_CHECKSUMS
record again).

I ran the tests with this new patch, and I haven't reproduced the
crashes. I'll let it run a bit longer, and improve it to test some more
stuff, but it looks good.

regards

--
Tomas Vondra

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2024-10-09 10:59:20 Re: overflow bug for inhcounts
Previous Message Michael Paquier 2024-10-09 09:54:18 Re: GUC names in messages