From: | Simon Riggs <simon(at)2ndquadrant(dot)com> |
---|---|
To: | Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, Russell Smith <mr-russ(at)pws(dot)com(dot)au>, josh(at)agliodbs(dot)com, Postgres Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Checkpoint cost, looks like it is WAL/CRC |
Date: | 2005-07-08 09:17:51 |
Message-ID: | 1120814272.3940.299.camel@localhost.localdomain |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, 2005-07-07 at 11:59 -0400, Bruce Momjian wrote:
> Tom Lane wrote:
> > Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > > Tom Lane wrote:
> > >> The point here is that fsync-off is only realistic for development
> > >> or playpen installations. You don't turn it off in a production
> > >> machine, and I can't see that you'd turn off the full-page-write
> > >> option either. So we have not solved anyone's performance problem.
> >
> > > Yes, this is basically another fsync-like option that isn't for
> > > production usage in most cases. Sad but true.
> >
> > Just to make my position perfectly clear: I don't want to see this
> > option shipped in 8.1. It's reasonable to have it in there for now
> > as an aid to our performance investigations, but I don't see that it
> > has any value for production.
>
> Well, this is the first I am hearing that, and of course your position
> is just one vote.
>
> One idea would be to just tie its behavior directly to fsync and remove
> the option completely (that was the original TODO), or we can adjust it
> so it doesn't have the same risks as fsync, or the same lack of failure
> reporting as fsync.
I second Tom's objection, until we agree either:
- a conclusive physical test that shows that specific hardware *never*
causes torn pages
- a national/international standard name/number for everybody to ask
their manufacturer whether or not they comply with that (I doubt that
exists...)
- a conclusive check for torn pages that can be added to the recovery
code to show whether or not they have occurred.
Is there also a potential showstopper in the redo machinery? We work on
the assumption that the post-checkpoint block is available in WAL as a
before image. Redo for all actions merely replay the write action again
onto the block. If we must reapply the write action onto the block, the
redo machinery must check to see whether the write action has already
been successfully applied before it decides to redo. I'm not sure that
the current code does that.
Having raised that objection, ISTM that checking for torn pages can be
accomplished reasonably well using a few rules... These are simple
because we do not update in place for MVCC.
Since inserts and vacuums alter the pd_upper and pd_lower, we should be
able to do a self-consistency check that shows that all items are
correctly placed. If there is non-zero data higher than the pd_higher
pointer, then we know that the first sector is torn. If a pointer
doesn't match with a row version, then the page is torn.
It is possible that the first sector of a page could be undetectably
torn if it was nearly full and the item pointer pointed to the first
sector. However, for every page touched, the last WAL record to touch
that page should have an LSN that matches the database page. In most
cases they would match, proving the page was not torn. If they did not
match we would have no proof either way, so we would be advised to act
as if the page were torn for that situation. Possibly, we could
reinstate the idea of putting the LSN at the beginning and end of every
page, since that would help prove the first sector (only) was not torn.
It is possible that a page could be torn and yet still be consistent,
but this could only occur for a delete. Reapplying the delete, whether
or not it is visible on the page would overcome that without problem.
It is possible that there are one or more sectors of empty space in the
middle of a block could be torn, but their contents would still be
identical so is irrelevant and can be ignored.
Best Regards, Simon Riggs
From | Date | Subject | |
---|---|---|---|
Next Message | Dawid Kuroczko | 2005-07-08 09:41:23 | Re: Checkpoint cost, looks like it is WAL/CRC |
Previous Message | Zeugswetter Andreas DAZ SD | 2005-07-08 07:34:16 | Re: Checkpoint cost, looks like it is WAL/CRC |