Re: pg_stat_database.checksum_failures vs shared relations

From: Andres Freund <andres(at)anarazel(dot)de>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Magnus Hagander <magnus(at)hagander(dot)net>, Julien Rouhaud <rjuju123(at)gmail(dot)com>
Subject: Re: pg_stat_database.checksum_failures vs shared relations
Date: 2025-03-28 01:02:02
Message-ID: 3j6jxcigfavufwkfxbzyauj4fibfwhl37uqh3ttjko26meqxft@p7jelszxn6nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2025-03-28 09:44:58 +0900, Michael Paquier wrote:
> On Thu, Mar 27, 2025 at 12:06:45PM -0400, Robert Haas wrote:
> > On Thu, Mar 27, 2025 at 11:58 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > > So, today we have the weird situation that *some* checksum errors on shared
> > > relations get attributed to the current database (if they happen in a backend
> > > normally accessing a shared relation), whereas others get reported to the
> > > "shared relations" "database" (if they happen during a base backup). That
> > > seems ... not optimal.
> > >
> > > One question is whether we consider this a bug that should be backpatched.
> >
> > I think it would be defensible if pg_basebackup reported all errors
> > with OID 0 and backend connections reported all errors with OID
> > MyDatabaseId, but it seems hard to justify having pg_basebackup take
> > care to report things using the correct database OID and individual
> > backend connections not take care to do the same thing. So I think
> > this is a bug. If fixing it in the back-branches is too annoying, I
> > think it would be reasonable to fix it only in master, but
> > back-patching seems OK too.
>
> Being able to get a better reporting for shared relations in back
> branches would be nice, but that's going to require some invasive
> chirurgy, isn't it?

Yea, that's what I was worried about too. I think we basically would need a
PageIsVerifiedExtended2() that backs the current PageIsVerifiedExtended(),
with optional arguments that the "fixed" callers would use.

> We don't know currently the OID of the relation whose block is
> corrupted with only PageIsVerifiedExtended().

I don't think the relation oid is really the relevant bit, it's the database
oid (or alternatively tablespace). But PageIsVerifiedExtended() doesn't know
that either, obviously.

> There are two callers of PIV_REPORT_STAT on HEAD:

> - The checksum reports from RelationCopyStorage() know the
> SMgrRelation.
> - ReadBuffersOperation() has an optional Relation and a
> SMgrRelationData.

An SMgrRelationData suffices, via ->smgr_rlocator.locator.dbOid.

FWIW, it turns out that there are more cases than just MyDatabaseId and
InvalidOid - ScanSourceDatabasePgClass() and RelationCopyStorageUsingBuffer()
read buffers in a different database than MyDatabaseId.

> We could just refactor PageIsVerifiedExtended() so as it reports a
> state about why the verification failed and let the callers report the
> checksum failure with a relation OID, splitting the data for shared
> and non-shared relations?

Yea, I think we basically need a *checksum_failed out argument, and then the
callers need to do

if (checksum_failure)
pgstat_report_checksum_failures_in_db(src->smgr_rlocator.locator.dbOid, 1);

Or alternatively we can optionally pass in the rlocator to
PageIsVerifiedExtended2(), so it can do the above internally.

Btw, it seems somewhat odd that we accumulate stats for checksum failures but
not for invalid page headers - the latter seems even worse...

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2025-03-28 01:22:19 Re: Cross-type index comparison support in contrib/btree_gin
Previous Message Michael Paquier 2025-03-28 00:44:58 Re: pg_stat_database.checksum_failures vs shared relations