From: | Curt Sampson <cjs(at)cynic(dot)net> |
---|---|
To: | Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> |
Cc: | Michael Loftis <mloftis(at)wgops(dot)com>, mlw <markw(at)mohawksoft(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Index Scans become Seq Scans after VACUUM ANALYSE |
Date: | 2002-06-21 02:18:14 |
Message-ID: | Pine.NEB.4.43.0206211106390.437-100000@angelic.cynic.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, 20 Jun 2002, Bruce Momjian wrote:
> > MS SQL Server has an interesting way of dealing with this. They have a
> > "torn" bit in each 512-byte chunk of a page, and this bit is set the
> > same for each chunk. When they are about to write out a page, they first
> > flip all of the torn bits and then do the write. If the write does not
> > complete due to a system crash or whatever, this can be detected later
> > because the torn bits won't match across the entire page.
>
> I was wondering, how does knowing the block is corrupt help MS SQL?
I'm trying to recall, but I can't off hand. I'll have to look it
up in my Inside SQL Server book, which is at home right now,
unfortunately. I'll bring the book into work and let you know the
details later.
> Right now, we write changed pages to WAL, then later write them to disk.
Ah. You write the entire page? MS writes only the changed tuple.
And DB2, in fact, goes one better and writes only the part of the
tuple up to the change, IIRC. Thus, if you put smaller and/or more
frequently changed columns first, you'll have smaller logs.
> I have always been looking for a way to prevent these WAL writes. The
> 512-byte bit seems interesting, but how does it help?
Well, this would at least let you reduce the write to the 512-byte
chunk that changed, rather than writing the entire 8K page.
> And how does the bit help them with partial block writes? Is the bit at
> the end of the block? Is that reliable?
The bit is somewhere within every 512 byte "disk page" within the
8192 byte "filesystem/database page." So an 8KB page is divided up
like this:
| <----------------------- 8 Kb ----------------------> |
| 512b | 512b | 512b | 512b | 512b | 512b | 512b | 512b |
Thus, the tear bits start out like this:
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
After a successful write of the entire page, you have this:
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
If the write is unsuccessful, you end up with something like this:
| 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
And now you know which parts of your page got written, and which
parts didn't.
cjs
--
Curt Sampson <cjs(at)cynic(dot)net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC
From | Date | Subject | |
---|---|---|---|
Next Message | David Ford | 2002-06-21 02:23:29 | Re: cvs read lock |
Previous Message | Rod Taylor | 2002-06-21 02:08:40 | Domain coercion |