Quick Links

Re: Online enabling of checksums

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Magnus Hagander <magnus(at)hagander(dot)net>
Cc:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Daniel Gustafsson <daniel(at)yesql(dot)se>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Online enabling of checksums
Date:	2018-04-07 00:59:28
Message-ID:	20180407005928.mwhgxzxjxqpa4dho@alap3.anarazel.de
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Here's a pass through the patch:
@@ -1033,7 +1034,7 @@ XLogInsertRecord(XLogRecData *rdata,
Assert(RedoRecPtr < Insert->RedoRecPtr);
RedoRecPtr = Insert->RedoRecPtr;
}
- doPageWrites = (Insert->fullPageWrites || Insert->forcePageWrites);
+ doPageWrites = (Insert->fullPageWrites || Insert->forcePageWrites || DataChecksumsInProgress());

if (fpw_lsn != InvalidXLogRecPtr && fpw_lsn <= RedoRecPtr && doPageWrites)

Why does this need an in-progress specific addition? Given that we
unconditionally log FPWs for all pages, and

#define XLogHintBitIsNeeded() (DataChecksumsNeedWrite() || wal_log_hints)

it's not clear what this achieves? At the very least needs a comment.

@@ -4748,12 +4745,90 @@ GetMockAuthenticationNonce(void)
* Are checksums enabled for data pages?
*/
bool
-DataChecksumsEnabled(void)
+DataChecksumsNeedWrite(void)
{
Assert(ControlFile != NULL);
return (ControlFile->data_checksum_version > 0);
}

+bool
+DataChecksumsNeedVerify(void)
+{
+ Assert(ControlFile != NULL);
+
+ /*
+ * Only verify checksums if they are fully enabled in the cluster. In
+ * inprogress state they are only updated, not verified.
+ */
+ return (ControlFile->data_checksum_version == PG_DATA_CHECKSUM_VERSION);
+}
+
+bool
+DataChecksumsInProgress(void)
+{
+ Assert(ControlFile != NULL);
+ return (ControlFile->data_checksum_version == PG_DATA_CHECKSUM_INPROGRESS_VERSION);
+}

As previously mentioned, the locking model around this is unclear. It's
probably fine due to to surrounding memory barriers, but that needs to
be very very explicitly documented.

+void
+SetDataChecksumsOn(void)
+{
+ Assert(ControlFile != NULL);
+
+ LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+
+ if (ControlFile->data_checksum_version != PG_DATA_CHECKSUM_INPROGRESS_VERSION)
+ {
+ LWLockRelease(ControlFileLock);
+ elog(ERROR, "Checksums not in inprogress mode");
+ }
+
+ ControlFile->data_checksum_version = PG_DATA_CHECKSUM_VERSION;
+ UpdateControlFile();
+ LWLockRelease(ControlFileLock);
+
+ XlogChecksums(PG_DATA_CHECKSUM_VERSION);

As I've explained in
https://www.postgresql.org/message-id/20180406235126.d4sg4dtgicdpucnj@alap3.anarazel.de
this appears to be unsafe. There's no guarantee that the
ControlFile->data_checksum_version hasn't intermittently set to 0.

@@ -7788,6 +7863,16 @@ StartupXLOG(void)
*/
CompleteCommitTsInitialization();

+ /*
+ * If we reach this point with checksums in inprogress state, we notify
+ * the user that they need to manually restart the process to enable
+ * checksums.
+ */
+ if (ControlFile->data_checksum_version == PG_DATA_CHECKSUM_INPROGRESS_VERSION)
+ ereport(WARNING,
+ (errmsg("checksum state is \"inprogress\" with no worker"),
+ errhint("Either disable or enable checksums by calling the pg_disable_data_checksums() or pg_enable_data_checksums() functions.")));
+

Hm, so one manually has to take action here. Any reason we can't just
re-start the worker? Also, this'll be issued on standbys etc too, that
seems misleading?

+
+/*
+ * Disables checksums for the cluster, unless already disabled.
+ *
+ * Has immediate effect - the checksums are set to off right away.
+ */
+Datum
+disable_data_checksums(PG_FUNCTION_ARGS)
+{
+ /*
+ * If we don't need to write new checksums, then clearly they are already
+ * disabled.
+ */
+ if (!DataChecksumsNeedWrite())
+ ereport(ERROR,
+ (errmsg("data checksums already disabled")));
+
+ ShutdownChecksumHelperIfRunning();
+
+ SetDataChecksumsOff();
+
+ PG_RETURN_VOID();

Unsafe, see SetDataChecksumsOn comment above.

Shouldn't this be named with a pg_? We normally do that for SQL callable
functions, no? See the preceding functions.

This function is marked PROPARALLEL_SAFE. That can't be right?

Also, shouldn't this refuse to work if called in recovery mode? Erroring
out with
ERROR: XX000: cannot make new WAL entries during recovery
doesn't seem right.

+/*
+ * Enables checksums for the cluster, unless already enabled.
+ *
+ * Supports vacuum-like cost-based throttling, to limit system load.
+ * Starts a background worker that updates checksums on existing data.
+ */
+Datum
+enable_data_checksums(PG_FUNCTION_ARGS)

This is PROPARALLEL_RESTRICTED. That doesn't strike me right, shouldn't
they be PROPARALLEL_UNSAFE? It might be fine, but I'd not want to rely
on it.

+/*
+ * Main entry point for checksumhelper launcher process.
+ */
+bool
+StartChecksumHelperLauncher(int cost_delay, int cost_limit)

entry point sounds a bit like it's the bgw invoked routine...

+/*
+ * ShutdownChecksumHelperIfRunning
+ * Request shutdown of the checksumhelper
+ *
+ * This does not turn off processing immediately, it signals the checksum
+ * process to end when done with the current block.
+ */
+void
+ShutdownChecksumHelperIfRunning(void)
+{
+ /* If the launcher isn't started, there is nothing to shut down */
+ if (pg_atomic_unlocked_test_flag(&ChecksumHelperShmem->launcher_started))
+ return;

Using an unlocked op here without docs seems wrong. See also mail
referenced above.

+static bool
+ProcessSingleRelationFork(Relation reln, ForkNumber forkNum, BufferAccessStrategy strategy)
...
+ BlockNumber numblocks = RelationGetNumberOfBlocksInFork(reln, forkNum);

Shouldn't this have a comment explaining that this is safe because all
pages that concurrently get added to the end are going to be checksummed
by the extendor?

+ /* Need to get an exclusive lock before we can flag as dirty */
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);

Hm. So we'll need exclusive locks on all pages in the database. On busy
inner tables that's going to be painful. It shouldn't be too hard to
reduce this to share locks.

+ START_CRIT_SECTION();
+ MarkBufferDirty(buf);
+ log_newpage_buffer(buf, false);

Hm. So we always log buffers as non-standard ones? That's going to cause
quite the increase in FPW space.

+ elog(DEBUG2, "Checksumhelper done with relation %d: %s",
+ relationId, (aborted ? "aborted" : "finished"));

Hm. Wouldn't it be advisable to include actual relation names? It's
pretty annoying to keep track this way.

+void
+ChecksumHelperLauncherMain(Datum arg)
....

+ /*
+ * Create a database list. We don't need to concern ourselves with
+ * rebuilding this list during runtime since any database created after
+ * this process started will be running with checksums turned on from the
+ * start.
+ */

Why is this true? What if somebody runs CREATE DATABASE while the
launcher / worker are processing a different database? It'll copy the
template database on the filesystem level, and it very well might not
yet have checksums set? Afaict the second time we go through this list
that's not cought.

+ if (processing == SUCCESSFUL)
+ {
+ pfree(db->dbname);
+ pfree(db);
...

Why bother with that and other deallocations here and in other places?
Launcher's going to quit anyway...

+ else if (processing == FAILED)
+ {
+ /*
+ * Put failed databases on the remaining list.
+ */
+ remaining = lappend(remaining, db);

Uh. So we continue checksumming the entire cluster if one database
failed? Given there's no restart capability at all, that seems hard to
defend?

+ CurrentDatabases = BuildDatabaseList();
+
+ foreach(lc, remaining)
+ {
+ ChecksumHelperDatabase *db = (ChecksumHelperDatabase *) lfirst(lc);
+ bool found = false;
+
+ foreach(lc2, CurrentDatabases)
+ {
+ ChecksumHelperDatabase *db2 = (ChecksumHelperDatabase *) lfirst(lc2);
+
+ if (db->dboid == db2->dboid)
+ {
+ found = true;

This is an O(N^2) comparison logic? There's clusters with tens of
thousands of databases... The comparison costs are low, but still.

+ /*
+ * Foreign tables have by definition no local storage that can be
+ * checksummed, so skip.
+ */
+ if (pgc->relkind == RELKIND_FOREIGN_TABLE)
+ continue;
+

That strikes me as a dangerous form of test. Shouldn't we instead check
whether a relfilenode exists? I'll note that this test currently
includes plain views. It's just the smgrexist() test that makes it
work.

- Andres

In response to

Re: Online enabling of checksums at 2018-04-05 20:12:08 from Andres Freund

Responses

Re: Online enabling of checksums at 2018-04-07 04:26:38 from Andres Freund
Re: Online enabling of checksums at 2018-04-10 16:18:46 from Robert Haas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Chapman Flack	2018-04-07 01:10:59	Re: [PATCH] Update README for Resource Owners
Previous Message	David Rowley	2018-04-07 00:43:29	Re: [HACKERS] path toward faster partition pruning