From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Cc: | Magnus Hagander <magnus(at)hagander(dot)net>, Julien Rouhaud <rjuju123(at)gmail(dot)com> |
Subject: | pg_stat_database.checksum_failures vs shared relations |
Date: | 2025-03-27 15:58:27 |
Message-ID: | mglpvvbhighzuwudjxzu4br65qqcxsnyvio3nl4fbog3qknwhg@e4gt7npsohuz |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
First - I find it rather shocking that we have absolutely *zero* tests of
checksum failures in the backend. Zero. As evidenced by [1]. I really can't
quite believe it. Nor do we have tests of ignore_checksum_failure or
zero_damaged_pages.
I was trying to write some tests for checksums vs AIO, and one thing I noticed
is that our docs seem to rather strongly hint that checksum failures on shared
objects would be reported to the dbname IS NULL row in pg_stat_database:
The <structname>pg_stat_database</structname> view will contain one row
for each database in the cluster, plus one for shared objects, showing
database-wide statistics.
Name of this database, or <literal>NULL</literal> for shared
objects.
Number of data page checksum failures detected in this
database (or on a shared object), or NULL if data checksums are
disabled.
But that is not the case, they always get reported to MyDatabaseId by
PageIsVerifiedExtended().
The docs hint more clearly at checksum failures of shared rels reported this
way starting with:
commit 252b707bc41
Author: Magnus Hagander <magnus(at)hagander(dot)net>
Date: 2019-04-17 13:51:48 +0200
Return NULL for checksum failures if checksums are not enabled
which made changes like:
<entry>Number of data page checksum failures detected in this
- database</entry>
+ database (or on a shared object), or NULL if data checksums are not
+ enabled.</entry>
</row>
The stats tracking of checksum failures was introduced in
Author: Magnus Hagander <magnus(at)hagander(dot)net>
Date: 2019-03-09 10:45:17 -0800
Track block level checksum failures in pg_stat_database
which already did report stats on shared objects that way:
@@ -151,6 +152,8 @@ PageIsVerified(Page page, BlockNumber blkno)
errmsg("page verification failed, calculated checksum %u but expected %u",
checksum, p->pd_checksum)));
+ pgstat_report_checksum_failure();
+
if (header_sane && ignore_checksum_failure)
return true;
}
In basebackup.c however, it didn't track stats on shared relations at that time:
@@ -1580,6 +1583,9 @@ sendFile(const char *readfilename, const char *tarfilename, struct stat *statbuf
ereport(WARNING,
(errmsg("file \"%s\" has a total of %d checksum verification "
"failures", readfilename, checksum_failures)));
+
+ if (dboid != InvalidOid)
+ pgstat_report_checksum_failures_in_db(dboid, checksum_failures);
}
total_checksum_failures += checksum_failures;
that was changed in:
commit 77bd49adba4
Author: Magnus Hagander <magnus(at)hagander(dot)net>
Date: 2019-04-12 14:04:50 +0200
Show shared object statistics in pg_stat_database
This adds a row to the pg_stat_database view with datoid 0 and datname
NULL for those objects that are not in a database. This was added
particularly for checksums, but we were already tracking more satistics
for these objects, just not returning it.
...
which made the call to pgstat_report_checksum_failures_in_db()
in basebackup.c unconditional. It did leave this comment though:
* If dboid is anything other than InvalidOid then any checksum failures
* detected will get reported to the cumulative stats system.
I think the above commit makes it pretty clear that the intent is for checksum
errors on shared database entries to be reported to the "shared" entry in
pg_stat_database.
So, today we have the weird situation that *some* checksum errors on shared
relations get attributed to the current database (if they happen in a backend
normally accessing a shared relation), whereas others get reported to the
"shared relations" "database" (if they happen during a base backup). That
seems ... not optimal.
One question is whether we consider this a bug that should be backpatched.
To fix it we'd need to provide a bit more information to
PageIsVerifiedExtended(), it currently doesn't know what database the page it
is verifying is in and therefore can't report an error with
pgstat_report_checksum_failures_in_db() (rather than
pgstat_report_checksum_failure(), which attributes to MyDatabaseId).
Obviously having to change the signature of PageIsVerifiedExtended() makes it
harder to fix in the backbranches.
Greetings,
Andres Freund
[1] https://coverage.postgresql.org/src/backend/storage/page/bufpage.c.gcov.html#136
From | Date | Subject | |
---|---|---|---|
Next Message | Renan Alves Fonseca | 2025-03-27 16:02:34 | Remove restrictions in recursive query |
Previous Message | Robert Haas | 2025-03-27 15:48:26 | Re: libpq maligning postgres stability |