From: | Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Experimental patch for inter-page delay in VACUUM |
Date: | 2003-11-10 04:07:12 |
Message-ID: | 200311100407.hAA47Cp22890@candle.pha.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Tom Lane wrote:
> Greg Stark <gsstark(at)mit(dot)edu> writes:
> > Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> >> You want to find, open, and fsync() every file in the database cluster
> >> for every checkpoint? Sounds like a non-starter to me.
>
> > Except a) this is outside any critical path, and b) only done every few
> > minutes and c) the fsync calls on files with no dirty buffers ought to be
> > cheap, at least as far as i/o.
>
> The directory search and opening of the files is in itself nontrivial
> overhead ... particularly on systems where open(2) isn't speedy, such
> as Solaris. I also disbelieve your assumption that fsync'ing a file
> that doesn't need it will be free. That depends entirely on what sort
> of indexes the OS keeps on its buffer cache. There are Unixen where
> fsync requires a scan through the entire buffer cache because there is
> no data structure that permits finding associated buffers any more
> efficiently than that. (IIRC, the HPUX system I'm typing this on is
> like that.) On those sorts of systems, we'd be way better off to use
> O_SYNC or O_DSYNC on all our writes than to invoke multiple fsyncs.
> Check the archives --- this was all gone into in great detail when we
> were testing alternative methods for fsyncing the WAL files.
Not sure on this one --- let's look at our options
O_SYNC
fsync
sync
Now, O_SYNC is going to force every write to the disk. If we have a
transaction that has to write lots of buffers (has to write them to
reuse the shared buffer), it will have to wait for every buffer to hit
disk before the write returns --- this seems terrible to me and gives
the drive no way to group adjacent writes. Even on HPUX, which has poor
fsync dirty buffer detection, if the fsync is outside the main
processing loop (checkpoint process), isn't fsync better than O_SYNC?
Now, if we are sure that writes will happen only in the checkpoint
process, O_SYNC would be OK, I guess, but will we ever be sure of that?
I can't imagine a checkpoint process keeping up with lots of active
backends, especially if the writes use O_SYNC. The problem is that
instead of having backend write everything to kernel buffers, we are all
of a sudden forcing all writes of dirty buffers to disk. sync() starts
to look very attractive compared to that option.
fsync is better in that we can force it after a number of writes, and
can delay it, so we can write a buffer and reuse it, then later issue
the fsync. That is a win, though it doesn't allow the drive to group
adjacent writes in different files. Sync of course allows grouping of
all writes by the drive, but writes all non-PostgreSQL dirty buffers
too. Ideally, we would have an fsync() where we could pass it a list of
our files and it would do all of them optimally.
From what I have heard so far, sync() still seems like the most
efficient method. I know it only schedules write, but with a sleep
after it, it seems like maybe the best bet.
--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
From | Date | Subject | |
---|---|---|---|
Next Message | Bruce Momjian | 2003-11-10 04:14:20 | Re: Experimental patch for inter-page delay in VACUUM |
Previous Message | Bruce Momjian | 2003-11-10 03:29:32 | Re: Erroneous PPC spinlock code |