From: | "J(dot) R(dot) Nield" <jrnield(at)usol(dot)com> |
---|---|
To: | Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> |
Cc: | Curt Sampson <cjs(at)cynic(dot)net>, Michael Loftis <mloftis(at)wgops(dot)com>, mlw <markw(at)mohawksoft(dot)com>, PostgreSQL Hacker <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Subject: | Re: Index Scans become Seq Scans after VACUUM ANALYSE |
Date: | 2002-06-22 22:22:58 |
Message-ID: | 1024784514.1793.242.camel@localhost.localdomain |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, 2002-06-20 at 21:58, Bruce Momjian wrote:
> I was wondering, how does knowing the block is corrupt help MS SQL?
> Right now, we write changed pages to WAL, then later write them to disk.
> I have always been looking for a way to prevent these WAL writes. The
> 512-byte bit seems interesting, but how does it help?
>
> And how does the bit help them with partial block writes? Is the bit at
> the end of the block? Is that reliable?
>
My understanding of this is as follows:
1) On most commercial systems, if you get a corrupted block (from
partial write or whatever) you need to restore the file(s) from the most
recent backup, and replay the log from the log archive (usually only the
damaged files will be written to during replay).
2) If you can't deal with the downtime to recover the file, then EMC,
Sun, or IBM will sell you an expensive disk array with an NVRAM cache
that will do atomic writes. Some plain-vanilla SCSI disks are also
capable of atomic writes, though usually they don't use NVRAM to do it.
The database must then make sure that each page-write gets translated
into exactly one SCSI-level write. This is one reason why ORACLE and
Sybase recommend that you use raw disk partitions for high availability.
Some operating systems support this through the filesystem, but it is OS
dependent. I think Solaris 7 & 8 has support for this, but I'm not sure.
PostgreSQL has trouble because it can neither archive logs for replay,
nor use raw disk partitions.
One other point:
Page pre-image logging is fundamentally the same as what Jim Grey's
book[1] would call "careful writes". I don't believe they should be in
the XLOG, because we never need to keep the pre-images after we're sure
the buffer has made it to the disk. Instead, we should have the buffer
IO routines implement ping-pong writes of some kind if we want
protection from partial writes.
Does any of this make sense?
;jrnield
[1] Grey, J. and Reuter, A. (1993). "Transaction Processing: Concepts
and Techniques". Morgan Kaufmann.
--
J. R. Nield
jrnield(at)usol(dot)com
From | Date | Subject | |
---|---|---|---|
Next Message | Matthew T. O'Connor | 2002-06-22 22:32:39 | Re: pg_dump and ALTER TABLE / ADD FOREIGN KEY |
Previous Message | Tom Lane | 2002-06-22 21:45:09 | Re: Hash and bools |