Quick Links

Re: O_DIRECT for WAL writes

From:	Mary Edie Meredith <maryedie(at)osdl(dot)org>
To:	Neil Conway <neilc(at)samurai(dot)com>
Cc:	ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-patches(at)postgresql(dot)org
Subject:	Re: O_DIRECT for WAL writes
Date:	2005-06-02 18:49:28
Message-ID:	1117738168.2922.411.camel@localhost
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers pgsql-patches

On Thu, 2005-06-02 at 11:39 +1000, Neil Conway wrote:
> On Wed, 2005-06-01 at 17:08 -0700, Mary Edie Meredith wrote:
> > I know I'm late to this discussion, and I haven't made it all the way
> > through this thread to see if your questions on Linux writes were
> > resolved. If you are still interested, I recommend read a very good
> > one page description of reliable writes buried in the Data Center Linux
> > Goals and Capabilities document.
>
> This suggests that on Linux a write() on a file opened with O_DIRECT has
> the same synchronization guarantees as a write() on a file opened with
> O_SYNC, which is precisely the opposite of what was concluded down
> thread. So now I'm more confused :)
>
> (Regardless of behavior on Linux, I would guess O_DIRECT doesn't behave
> this way on all platforms -- for example, FreeBSD's open(2) manpage does
> not mention I/O synchronization when referring to O_DIRECT. So even if
> we can skip the fsync() with O_DIRECT on Linux, I doubt we'll be able to
> do that on all platforms.)

My understanding is that O_DIRECT means "direct" as in "no buffering by
the OS" which implies that if you write from your buffer, the write is
not going to return unless the OS thinks the write is completed (or
unless you are using Async IO). Otherwise you might reuse your buffer
(there _is no other buffer after all) and if the write were incomplete
before refill you buffer for another, the first write might go from your
buffer with wrong data.

Now if you want to avoid _waiting for the write to complete, you need to
employ async io, which is why most databases that support direct io for
their datafiles also have implemented some form of async io as well
(either via OS calls or some built-in mechanism as is the case with
SAP-DB). With AIO you have to manage your buffers so that you reuse them
only when you are notified the IO is completed. Historically this was
done with raw datafiles, but currently (at least for Linux) you can also
do this with files. For logging, though, I think you want synchronous
IO to guarantee order.

The cool thing about buffering the datafile data yourself is that _you
(the database engine) can control what stays in (shared) memory and what
does not. You can add configuration options or add intelligence, so
that frequently used data (like hot indexes) can stay in memory
indefinitely. The OS can never do that so specifically. In addition,
you can avoid having data from table scans overwrite hot objects. Of
course, at the moment you are discussing the use for logging, but there
should be benefits to extending this to datafiles as well, assuming you
also implement async io.

Bottom line: if you do not implement direct/async IO so that you
optimize caching of hot database objects and minimize memory utilization
of objects used once, you are probably leaving performance on the table
for datafiles.

Daniel is on vacation, but I will ask him to confirm once he returns.
>
> -Neil
>
--
Mary Edie Meredith
maryedie(at)osdl(dot)org
503-906-1942
Data Center Linux Initiative Manager
Open Source Development Labs

In response to

Re: O_DIRECT for WAL writes at 2005-06-02 01:39:25 from Neil Conway

Responses

Re: O_DIRECT for WAL writes at 2005-06-03 00:37:39 from Neil Conway

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Marc G. Fournier	2005-06-02 21:25:37	Re: Google's Summer of Code ...
Previous Message	Greg Stark	2005-06-02 18:15:16	Re: NOLOGGING option, or ?

Browse pgsql-patches by date

	From	Date	Subject
Next Message	Andrew Dunstan	2005-06-02 23:20:37	Re: [Plperlng-devel] Re: return_next for plperl (was Re: call
Previous Message	Pavel Stehule	2005-06-02 18:13:10	Re: Oracle date type compat. functions: next_day, last_day,