Re: Checksums by default?

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Checksums by default?
Date: 2017-01-25 21:22:41
Message-ID: CAM3SWZQuySEU6VaTgZV8sDmu4ZLvnFA6_anR8VwGsuzC+7y91w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jan 25, 2017 at 12:23 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Also, I think that one of the big problems with the way checksums work
> is that you don't find problems with your archived data until it's too
> late. Suppose that in February bits get flipped in a block. You
> don't access the data until July[1]. Well, it's nice to have the
> system tell you that the data is corrupted, but what are you going to
> do about it? By that point, all of your backups are probably
> corrupted. So it's basically:
>
> ERROR: you're screwed
>
> It's nice to know that (maybe?) but without a recovery strategy a
> whole lot of people who get that message are going to immediately
> start asking "How do I ignore the fact that I'm screwed and try to
> read the data anyway?".

That's also how I tend to think about it.

I understand that my experience with storage devices is unusually
narrow compared to everyone else here. That's why I remain neutral on
the high level question of whether or not we ought to enable checksums
by default. I'll ask other hackers to answer what may seem like a very
naive question, while bearing what I just said in mind. The question
is: Have you ever actually seen a checksum failure in production? And,
if so, how helpful was it?

I myself have not, despite the fact that Heroku uses checksums
wherever possible, and has the technical means to detect problems like
this across the entire fleet of customer databases. Not even once.
This is not what I would have expected myself several years ago.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tobias Oberstein 2017-01-25 21:27:40 Re: lseek/read/write overhead becomes visible at scale ..
Previous Message Robert Haas 2017-01-25 21:20:42 Re: Proposal : For Auto-Prewarm.