From: | Bruce Momjian <bruce(at)momjian(dot)us> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Checksums, state of play |
Date: | 2012-03-06 17:50:24 |
Message-ID: | 20120306175024.GA1347@momjian.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Mar 06, 2012 at 09:25:17AM -0500, Robert Haas wrote:
> > 2. Turning checksums on/off/on/off in rapid succession can cause false
> > positive reports of checksum failure if crashes occur and are ignored.
> > That may lead to the feature and PostgreSQL being held in disrepute.
>
> This I do think is a problem, although not for precisely the reason
> stated here. In my experience, in data corruption situations, the
> first thing customers do is blame PostgreSQL: they don't believe it's
> the hardware; they accuse us of having bugs in our code. Having a
> checksum feature would be valuable, because, first, we'd perhaps
> detect problems sooner and, second, people understand what checksums
> are and that checksum failures really shouldn't happen unless the
> hardware is bad. More generally, one of the purposes of checksums is
> to distinguish hardware failure from other possible causes of data
> corruption problems. If there are code paths where checksum failures
> can happy despite the hardware being good, I think that the patch will
> fail to accomplish its goal of giving us confidence that the hardware
> is bad.
I think the "turning checksums on/off/on/off" is really a killer
problem, and obviously many of the actions needed to make it safe make
the checksum feature itself less useful.
One crazy idea would be to have a checksum _version_ number somewhere on
the page and in pg_controldata. When you turn on checksums, you
increment that value, and all new checksum pages get that checksum
version; if you turn off checksums, we just don't check them anymore,
but they might get incorrect due to a hint bit write and a crash. When
you turn on checksums again, you increment the checksum version again,
and only check pages having the _new_ checksum version.
Yes, this does add additional storage requirements for the checksum, but
I don't see another clean option. If you can spare one byte, that gives
you 255 times to turn on checksums; after that, you have to
dump/reload to use the checksum feature.
--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ It's impossible for everything to be true. +
From | Date | Subject | |
---|---|---|---|
Next Message | Bruce Momjian | 2012-03-06 17:56:13 | Re: Checksums, state of play |
Previous Message | Robert Haas | 2012-03-06 17:47:00 | Re: elegant and effective way for running jobs inside a database |