Robert Haas wrote:
> Jeff Janes wrote:
>> But it doesn't seem safe to me replace a page from the DW buffer
>> and then apply WAL to that replaced page which preceded the age of
>> the page in the buffer.
>
> That's what LSNs are for.
Agreed.
> If we write the page to the checkpoint buffer just once per
> checkpoint, recovery can restore the double-written versions of the
> pages and then begin WAL replay, which will restore all the
> subsequent changes made to the page. Recovery may also need to do
> additional double-writes if it encounters pages that for which we
> wrote WAL but never flushed the buffer, because a crash during
> recovery can also create torn pages.
That's a good point. I think WAL application does need to use
double-write. As usual, it doesn't affect *when* a page must be
written, but *how*.
> When we reach a restartpoint, we fsync everything down to disk and
> then nuke the double-write buffer.
I think we add to the double-write buffer as we write pages from the
buffer to disk. I don't think it makes sense to do potentially
repeated writes of the same page with different contents to the
double-write buffer as we go; nor is it a good idea to leave the page
unsynced and let the double-write buffer grow for a long time.
> Similarly, in normal running, we can nuke the double-write buffer
> at checkpoint time, once the fsyncs are complete.
Well, we should nuke it for re-use as soon as all pages in the buffer
are written and fsynced. I'm not at all sure that the best
performance is hit by waiting for checkpoint for that versus doing it
at page eviction time.
The whole reason that double-write techniques don't double the write
time is that it is relatively small and the multiple writes to the
same disk sectors get absorbed by the BBU write-back without actually
hitting the disk all the time. Letting the double-write buffer grow
to a large size seems likely to me to be a performance killer. The
whole double-write, including fsyncs to buffer and the actual page
location should just be considered part of the page write process, I
think.
-Kevin