Re: We really ought to do something about O_DIRECT and data=journalled on ext4

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We really ought to do something about O_DIRECT and data=journalled on ext4
Date: 2010-12-07 01:34:33
Message-ID: 5875.1291685673@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Greg Smith <greg(at)2ndquadrant(dot)com> writes:
> So my guess is that some small percentage of Windows users might notice
> a change here, and some testing on FreeBSD would be useful too. That's
> about it for platforms that I think anybody needs to worry about.

To my mind, O_DIRECT is not really the key issue here, it's whether to
prefer O_DSYNC or fdatasync. I looked back in the archives, and I think
that the main reason we prefer O_DSYNC when available is the results
I got here:

http://archives.postgresql.org/pgsql-hackers/2001-03/msg00381.php

which demonstrated a performance benefit on HPUX 10.20, though with a
test tool much more primitive than test_fsync. I still have that
machine, although the disk that was in it at the time died awhile back.
What's in there now is a Seagate ST336607LW spinning at 10000 RPM (166
rev/sec) and today I get numbers like this from test_fsync:

Simple write:
8k write 28331.020/second

Compare file sync methods using one write:
open_datasync 8k write 161.190/second
open_sync 8k write 156.478/second
8k write, fdatasync 54.302/second
8k write, fsync 51.810/second

Compare file sync methods using two writes:
2 open_datasync 8k writes 81.702/second
2 open_sync 8k writes 80.172/second
8k write, 8k write, fdatasync 40.829/second
8k write, 8k write, fsync 39.836/second

Compare open_sync with different sizes:
open_sync 16k write 80.192/second
2 open_sync 8k writes 78.018/second

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
8k write, fsync, close 52.527/second
8k write, close, fsync 54.092/second

So *on that rather ancient platform* there's a measurable performance
benefit to O_DSYNC, but this seems to be largely because fdatasync is
stubbed to fsync in userspace rather than because fdatasync wouldn't
be a better idea in the abstract. Also, a lot of the argument against
fsync at the time was that it forced the kernel to iterate through all
the buffers for the WAL file to see if any were dirty. I would imagine
that modern kernels are a tad smarter about that; and even if they
aren't, the CPU speed versus disk speed tradeoff has changed enough
since 2001 that iterating through 16MB of buffers isn't as interesting
as it was then.

So to my mind, switching to the preference order fdatasync,
fsync_writethrough, fsync seems like the thing to do. Since we assume
fsync is always available, that means that O_DSYNC/O_SYNC will not be
the defaults on any platform.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2010-12-07 01:38:14 Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+
Previous Message Steve Singer 2010-12-07 00:43:15 Re: We really ought to do something about O_DIRECT and data=journalled on ext4