Re: Background writer process

From: Shridhar Daithankar <shridhar_daithankar(at)myrealbox(dot)com>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Background writer process
Date: 2003-11-18 06:59:39
Message-ID: 3FB9C35B.3000201@myrealbox.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Bruce Momjian wrote:

> Shridhar Daithankar wrote:
>
>>On Friday 14 November 2003 22:10, Jan Wieck wrote:
>>
>>>Shridhar Daithankar wrote:
>>>
>>>>On Friday 14 November 2003 03:05, Jan Wieck wrote:
>>>>
>>>>>For sure the sync() needs to be replaced by the discussed fsync() of
>>>>>recently written files. And I think the algorithm how much and how often
>>>>>to flush can be significantly improved. But after all, this does not
>>>>>change the real checkpointing at all, and the general framework having a
>>>>>separate process is what we probably want.
>>>>
>>>>Having fsync for regular data files and sync for WAL segment a
>>>>comfortable compromise? Or this is going to use fsync for all of them.
>>>>
>>>>IMO, with fsync, we tell kernel that you can write this buffer. It may or
>>>>may not write it immediately, unless it is hard sync.
>>>
>>>I think it's more the other way around. On some systems sync() might
>>>return before all buffers are flushed to disk, while fsync() does not.
>>
>>Oops.. that's bad.
>
>
> Yes, one I idea I had was to do an fsync on a new file _after_ issuing
> sync, hoping that this will complete after all the sync buffers are
> done.
>
>
>>>>Since postgresql can afford lazy writes for data files, I think this
>>>>could work.
>>>
>>>The whole point of a checkpoint is to know for certain that a specific
>>>change is in the datafile, so that it is safe to throw away older WAL
>>>segments.
>>
>>I just made another posing on patches for a thread crossing win32-devel.
>>
>>Essentially I said
>>
>>1. Open WAL files with O_SYNC|O_DIRECT or O_SYNC(Not sure if current code does
>>it. The hackery in xlog.c is not exactly trivial.)
>
>
> We write WAL, then fsync, so if we write multiple blocks, we can write
> them and fsync once, rather than O_SYNC every write.
>
>
>>2. Open data files normally and fsync them only in background writer process.
>>
>>Now BGWriter process will flush everything at the time of checkpointing. It
>>does not need to flush WAL because of O_SYNC(ideally but an additional fsync
>>won't hurt). So it just flushes all the file descriptors touched since last
>>checkpoint, which should not be much of a load because it is flushing those
>>files intermittently anyways.
>>
>>It could also work nicely if only background writer fsync the data files.
>>Backends can either wait or proceed to other business by the time disk is
>>flushed. Backends needs to wait for certain while committing and it should be
>>rather small delay of syncing to disk in current process as opposed to in
>>background process.
>>
>>In case of commit, BGWriter could get away with files touched in transaction
>>+WAL as opposed to all files touched since last checkpoint+WAL in case of
>>checkpoint. I don't know how difficult that would be.
>>
>>What is different in current BGwriter implementation? Use of sync()?
>
>
> Well, basically we are still discussing how to do this. Right now the
> backend writer patch uses sync(), but the final version will use fsync
> or O_SYNC, or maybe nothing.
>
> The open items are whether a background process can keep the dirty
> buffers cleaned fast enough to keep up with the maximum number of
> backends. We might need to use multiple processes or threads to do
> this. We certainly will have a background writer in 7.5 --- the big
> question is whether _all_ write will go through it. It certainly would
> be nice if it could, and Tom thinks it can, so we are still exploring
> this.

Given that fsync is blocking, the background writer has to scale up in terms of
processes/threads and load w.r.t. disk flushing.

I would vote for threads for a simple reason that, in BGWriter, threads are
needed only to flush the file. Get the fd, fsync it and get next one. No need to
make entire process thread safe.

Furthermore BGWriter has to detect the disk limit. If adding threads does not
improve fsyncing speed, it should stop adding them and wait. There is nothing to
do when disk is saturated.

> If the background writer uses fsync, it can write and allow the buffer
> to be reused and fsync later, while if we use O_SYNC, we have to wait
> for the O_SYNC write to happen before reusing the buffer; that will be
> slower.

Certainly. However an O_SYNC open file would not require fsync separately. I
suggested it only for WAL. But for WAL block grouping as suggested in another
post, all files with fsync might be a good idea.

Just a thought.

Shridhar

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2003-11-18 07:34:52 Re: start of transaction (was: Re: [PERFORM] Help with count(*))
Previous Message elein 2003-11-18 06:57:20 Re: Release cycle length