Simon Riggs wrote:
> So overall, I do now think its still possible to add an optional
> checksum in the 9.2 release and am willing to pursue it unless
> there are technical objections.
Just to restate Simon's proposal, to make sure I'm understanding it,
we would support a new page header format number and the old one in
9.2, both to be the same size and carefully engineered to minimize
what code would need to be aware of the version. PageHeaderIsValid()
and PageInit() certainly would, and we would need some way to set,
clear (maybe), and validate a CRC. We would need a GUC to indicate
whether to write the CRC, and if present we would always test it on
read and treat it as a damaged page if it didn't match. (Perhaps
other options could be added later, to support recovery attempts, but
let's not complicate a first cut.) This whole idea would depend on
either (1) trusting your storage system never to tear a page on write
or (2) getting the double-write feature added, too.
I see some big advantages to this over what I suggested to David.
For starters, using a flag bit and putting the CRC somewhere other
than the page header would require that each AM deal with the CRC,
exposing some function(s) for that. Simon's idea doesn't require
that. I was also a bit concerned about shifting tuple images to
convert non-protected pages to protected pages. No need to do that,
either. With the bit flags, I think there might be some cases where
we would be unable to add a CRC to a converted page because space was
too tight; that's not an issue with Simon's proposal.
Heikki was talking about a pre-convert tool. Neither approach really
needs that, although with Simon's approach it would be possible to
have a background *post*-conversion tool to add CRCs, if desired.
Things would continue to function if it wasn't run; you just wouldn't
have CRC protection on pages not updated since pg_upgrade was run.
Simon, does it sound like I understand your proposal?
Now, on to the separate-but-related topic of double-write. That
absolutely requires some form of checksum or CRC to detect torn
pages, in order for the technique to work at all. Adding a CRC
without double-write would work fine if you have a storage stack
which prevents torn pages in the file system or hardware driver. If
you don't have that, it could create a damaged page indication after
a hardware or OS crash, although I suspect that would be the
exception, not the typical case. Given all that, and the fact that
it would be cleaner to deal with these as two separate patches, it
seems the CRC patch should go in first. (And, if this is headed for
9.2, *very soon*, so there is time for the double-write patch to
follow.)
It seems to me that the full_page_writes GUC could become an
enumeration, with "off" having the current meaning, "wal" meaning
what "on" now does, and "double" meaning that the new double-write
technique would be used. (It doesn't seem to make any sense to do
both at the same time.) I don't think we need a separate GUC to tell
us *what* to protect against torn pages -- if not "off" we should
always protect the first write of a page after checkpoint, and if
"double" and write_page_crc (or whatever we call it) is "on", then we
protect hint-bit-only writes. I think. I can see room to argue that
with CRCs on we should do a full-page write to the WAL for a
hint-bit-only change, or that we should add another GUC to control
when we do this.
I'm going to take a shot at writing a patch for background hinting
over the holidays, which I think has benefit alone but also boosts
the value of these patches, since it would reduce double-write
activity otherwise needed to prevent spurious error when using CRCs.
This whole area has some overlap with spreading writes, I think. The
double-write approach seems to count on writing a bunch of pages
(potentially from different disk files) sequentially to the
double-write buffer, fsyncing that, and then writing the actual pages
-- which must be fsynced before the related portion of the
double-write buffer can be reused. The simple implementation would
be to simply fsync the files just written to if they required a prior
write to the double-write buffer, although fancier techniques could
be used to try to optimize that. Again, setting hint bits set before
the write when possible would help reduce the impact of that.
-Kevin