Re: Changing the state of data checksums in a running cluster

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Daniel Gustafsson <daniel(at)yesql(dot)se>
Cc: Michael Banck <mbanck(at)gmx(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Changing the state of data checksums in a running cluster
Date: 2024-11-08 00:41:10
Message-ID: e4dbcb2c-e04a-4ba2-bff0-8d979f55960e@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Unfortunately it seems we're not out of the woods yet :-(

I started doing some more testing on the v8 patch. My plan was to do
some stress testing with physical replication, random restarts and stuff
like that. But I ran into issues before that.

Attached is a reproducer script, that does this:

1) initializes an instance with a small (scale 10) pgbench database

2) runs a pgbench in the background, and flips checksums

3) restarts the database with fast or immediate mode

4) watches for checksums state until it reaches expected value

5) restarts the instance

Of course, the restart interrupts the checksum enable, with this message
in the log:

WARNING: data checksums are being enabled, but no worker is running
1731024482.102 2024-11-08 01:08:02.102 CET [267066] [startup:]
[672d5660.4133a:7] [2024-11-08 01:08:00 CET] [/0] HINT: If checksums
were being enabled during shutdown then processing must be manually
restarted.

That's expected, of course. So I did

SELECT pg_enable_data_checksums()

and "datachecksumsworker launcher" appeared in pg_stat_activity, but
nothing else was happening. It also says:

Waiting for worker in database template0 (pid 258442)

But there are no workers with that PID. Not in the OS, not in the view,
not in the server log. Seems a bit weird. Maybe it already completed,
but then why is there a launcher waiting for it?

Ultimately I tried running CHECKPOINT, And that apparently did the
trick, and the instance restarted. But then on start it hits an assert that:

(LocalDataChecksumVersion == PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION)

But this only happens in the final stop is -m immediate. If I change it
to "-m fast" it works.

I haven't looked into the details, but I guess it's related to the issue
with controlfile update we dealt with about a month ago.

Attached is the test.sh file (make sure to tweak the paths), and an
example of the backtraces. I've seen various processes hitting that.

Two more comments:

* It's a bit surprising that pg_disable_data_checksums() flips the state
right away, while pg_enable_data_checksums() waits for a checkpoint. I
guess it's correct, but maybe the docs should mention this difference?

* The docs currently say:

<para>
If the cluster is stopped while in <literal>inprogress-on</literal> mode,
for any reason, then this process must be restarted manually. To do this,
re-execute the function <function>pg_enable_data_checksums()</function>
once the cluster has been restarted. The background worker will attempt
to resume the work from where it was interrupted.
</para>

I believe that's incorrect/misleading. There's no attempt to resume work
from where it was interrupted.

regards

--
Tomas Vondra

Attachment Content-Type Size
test.sh application/x-shellscript 1.9 KB
backtraces.txt text/plain 4.2 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2024-11-08 00:42:33 Re: Deleting older versions in unique indexes to avoid page splits
Previous Message Andy Fan 2024-11-08 00:38:49 Re: Deleting older versions in unique indexes to avoid page splits