Re: CRCs (was: beta testing version)

From: Bruce Guenter <bruceg(at)em(dot)ca>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: CRCs (was: beta testing version)
Date: 2000-12-07 00:53:37
Message-ID: 20001206185337.A24108@em.ca
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

On Wed, Dec 06, 2000 at 11:08:00AM -0800, Nathan Myers wrote:
> On Wed, Dec 06, 2000 at 11:49:10AM -0600, Bruce Guenter wrote:
> > On Wed, Dec 06, 2000 at 11:15:26AM -0500, Tom Lane wrote:
> > > How exactly *do* we determine where the end of the valid log data is,
> > > anyway?
> >
> > I don't know how pgsql does it, but the only safe way I know of is to
> > include an "end" marker after each record. When writing to the log,
> > append the records after the last end marker, ending with another end
> > marker, and fdatasync the log. Then overwrite the previous end marker
> > to indicate it's not the end of the log any more and fdatasync again.
> >
> > To ensure that it is written atomically, the end marker must not cross a
> > hardware sector boundary (typically 512 bytes). This can be trivially
> > guaranteed by making the marker a single byte.
>
> An "end" marker is not sufficient, unless all writes are done in
> one-sector units with an fsync between, and the drive buffering
> is turned off.

That's why an end marker must follow all valid records. When you write
records, you don't touch the marker, and add an end marker to the end of
the records you've written. After writing and syncing the records, you
rewrite the end marker to indicate that the data following it is valid,
and sync again. There is no state in that sequence in which partially-
written data could be confused as real data, assuming either your drives
aren't doing write-back caching or you have a UPS, and fsync doesn't
return until the drives return success.

> For larger writes the OS will re-order the writes.
> Most drives will re-order them too, even if the OS doesn't.

I'm well aware of that.

> > Any other way I've seen discussed (here and elsewhere) either
> > - Assume that a CRC is a guarantee.
>
> We are already assuming a CRC is a guarantee.
>
> The drive computes a CRC for each sector, and if the CRC is OK the
> drive is happy. CRC errors within the drive are quite frequent, and
> the drive re-reads when a bad CRC comes up.

The kind of data failures that a CRC is guaranteed to catch (N-bit
errors) are almost precisely those that a mis-read on a hardware sector
would cause.

> > ... A CRC would be a good addition to
> > help ensure the data wasn't broken by flakey drive firmware, but
> > doesn't guarantee consistency.
> No, a CRC would be a good addition to compensate for sector write
> reordering, which is done both by the OS and by the drive, even for
> "atomic" writes.

But it doesn't guarantee consistency, even in that case. There is a
possibility (however small) that the random data that was located in the
sectors before the write will match the CRC.
--
Bruce Guenter <bruceg(at)em(dot)ca> http://em.ca/~bruceg/

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Bruce Guenter 2000-12-07 00:56:04 Re: AW: beta testing version
Previous Message John Pilley 2000-12-07 00:12:58 INSTALL Problems (Again) :(

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Guenter 2000-12-07 00:56:04 Re: AW: beta testing version
Previous Message Daniele Orlandi 2000-12-06 23:13:33 Re: AW: beta testing version