Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Jeff Layton <jlayton(at)redhat(dot)com>
Cc: Dave Chinner <david(at)fromorbit(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Kevin Grittner <kgrittn(at)ymail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Joshua Drake <jd(at)commandprompt(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Mel Gorman <mgorman(at)suse(dot)de>, Jim Nasby <jim(at)nasby(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "lsf-pc(at)lists(dot)linux-foundation(dot)org" <lsf-pc(at)lists(dot)linux-foundation(dot)org>, Magnus Hagander <magnus(at)hagander(dot)net>
Subject: Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance
Date: 2014-01-17 13:57:25
Message-ID: CA+TgmoZ8Z=886ABP2rNDvPcG7S1P8OMrTs0ToMQ3N1O-jo1gfA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jan 17, 2014 at 7:34 AM, Jeff Layton <jlayton(at)redhat(dot)com> wrote:
> So this says to me that the WAL is a place where DIO should really be
> reconsidered. It's mostly sequential writes that need to hit the disk
> ASAP, and you need to know that they have hit the disk before you can
> proceed with other operations.

Ironically enough, we actually *have* an option to use O_DIRECT here.
But it doesn't work well. See below.

> Also, is the WAL actually ever read under normal (non-recovery)
> conditions or is it write-only under normal operation? If it's seldom
> read, then using DIO for them also avoids some double buffering since
> they wouldn't go through pagecache.

This is the first problem: if replication is in use, then the WAL gets
read shortly after it gets written. Using O_DIRECT bypasses the
kernel cache for the writes, but then the reads stink. However, if
you configure wal_sync_method=open_sync and disable replication, then
you will in fact get O_DIRECT|O_SYNC behavior.

But that still doesn't work out very well, because now the guy who
does the write() has to wait for it to finish before he can do
anything else. That's not always what we want, because WAL gets
written out from our internal buffers for multiple different reasons.
If we're forcing the WAL out to disk because of transaction commit or
because we need to write the buffer protected by a certain WAL record
only after the WAL hits the platter, then it's fine. But sometimes
we're writing WAL just because we've run out of internal buffer space,
and we don't want to block waiting for the write to complete. Opening
the file with O_SYNC deprives us of the ability to control the timing
of the sync relative to the timing of the write.

> Again, I think this discussion would really benefit from an outline of
> the different files used by pgsql, and what sort of data access
> patterns you expect with them.

I think I more or less did that in my previous email, but here it is
again in briefer form:

- WAL files are written (and sometimes read) sequentially and fsync'd
very frequently and it's always good to write the data out to disk as
soon as possible
- Temp files are written and read sequentially and never fsync'd.
They should only be written to disk when memory pressure demands it
(but are a good candidate when that situation comes up)
- Data files are read and written randomly. They are fsync'd at
checkpoint time; between checkpoints, it's best not to write them
sooner than necessary, but when the checkpoint arrives, they all need
to get out to the disk without bringing the system to a standstill

We have other kinds of files, but off-hand I'm not thinking of any
that are really very interesting, apart from those.

Maybe it'll be useful to have hints that say "always write this file
to disk as quick as you can" and "always postpone writing this file to
disk for as long as you can" for WAL and temp files respectively. But
the rule for the data files, which are the really important case, is
not so simple. fsync() is actually a fine API except that it tends to
destroy system throughput. Maybe what we need is just for fsync() to
be less aggressive, or a less aggressive version of it. We wouldn't
mind waiting an almost arbitrarily long time for fsync to complete if
other processes could still get their I/O requests serviced in a
reasonable amount of time in the meanwhile.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2014-01-17 14:04:54 Re: WAL Rate Limiting
Previous Message Marti Raudsepp 2014-01-17 13:50:31 Re: plpgsql.consistent_into