Re: Checksum errors in pg_stat_database

From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Julien Rouhaud <rjuju123(at)gmail(dot)com>
Cc: David Steele <david(at)pgmasters(dot)net>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Checksum errors in pg_stat_database
Date: 2019-03-30 13:33:45
Message-ID: CABUevEzD_duH_hGyZw14o+khHBw-rWSSAxbEKt5HWy2cK0Djdw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Mar 13, 2019 at 4:54 PM Julien Rouhaud <rjuju123(at)gmail(dot)com> wrote:

> On Wed, Mar 13, 2019 at 4:53 PM Julien Rouhaud <rjuju123(at)gmail(dot)com> wrote:
> >
> > On Sun, Mar 10, 2019 at 1:13 PM Julien Rouhaud <rjuju123(at)gmail(dot)com>
> wrote:
> > >
> > > On Sat, Mar 9, 2019 at 7:58 PM Julien Rouhaud <rjuju123(at)gmail(dot)com>
> wrote:
> > > >
> > > > On Sat, Mar 9, 2019 at 7:50 PM Magnus Hagander <magnus(at)hagander(dot)net>
> wrote:
> > > > >
> > > > > On Sat, Mar 9, 2019 at 10:41 AM Julien Rouhaud <rjuju123(at)gmail(dot)com>
> wrote:
> > > > >>
> > > > >> Sorry, I have again new comments after a little bit more thinking.
> > > > >> I'm wondering if we can do something about shared objects while
> we're
> > > > >> at it. They don't belong to any database, so it's a little bit
> > > > >> orthogonal to this proposal, but it seems quite important to track
> > > > >> error on those too!
> > > > >>
> > > > >> What about adding a new field in PgStat_GlobalStats for that? We
> can
> > > > >> use the same lastDir to easily detect such objects and slightly
> adapt
> > > > >> sendFile again, which seems quite straightforward.
> > > >
> > > > > Question is then what number that should show -- only the checksum
> counter in non-database-fields, or the total number across the cluster?
> > > >
> > > > I'd say only for non-database-fields errors, especially if we can
> > > > reset each counters separately. If necessary, we can add a new view
> > > > to give a global overview of checksum errors for DBA convenience.
> > >
> > > I'm considering adding a new PgStat_ChecksumStats for that purpose
> > > instead, but I don't know if that's acceptable to do so in the last
> > > commitfest. It seems worthwhile to add it eventually, since we'll
> > > probably end up having more things to report to users related to
> > > checksum. Online enabling of checksum could be the most immediate
> > > potential target.
> >
> > I wasn't aware that we were already storing informations about shared
> > objects in PgStat_StatDBEntry, with an InvalidOid as databaseid
> > (though we don't have any system view that are actually showing
> > information for such objects).
> >
> > As a result I ended up simply adding counters for the number of total
> > checks and the timestamp of the last failure in PgStat_StatDBEntry,
> > making attached patch very lightweight. I moved all the checksum
> > related counters out of pg_stat_database in a new pg_stat_checksum
> > view. It avoids to make pg_stat_database too wide, and also allows to
> > display information about shared object in this new view (some of the
> > other counters don't really make sense for shared objects or could
> > break existing monitoring query). While at it, I tried to add a
> > little bit of documentation wrt. checksum monitoring.
>
> and of course I forgot to attach the patch.
>

Does it really make any sense to track "number of checksum checks"? In any
sort of interesting database that's just going to be an insanely high
number, isn't it? (And also, to stay consistent with checksum failures, we
should of course also count the checks done in base backups, which is not
in the patch. But I'm more thinking we should drop it)

I do like the addition of the "last failure" column, that's really useful.

Having thought some more about this, I wonder if the right thing to do is
to actually add a row to pg_stat_database for the global stats, rather than
invent a separate view. I can see the argument going both ways, but
particularly with the name pg_stat_checksums we are setting a pattern that
will create one view for each counter. That's not very good, I think.

In the end I'm somewhat split on the idea of pg_stat_database with a NULL
row or pg_stat_checkpoints. What do others think?

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2019-03-30 13:35:29 Re: Online verification of checksums
Previous Message Michael Paquier 2019-03-30 13:12:33 Re: pgsql: Improve autovacuum logging for aggressive and anti-wraparound ru