From: | Jan Kara <jack(at)suse(dot)cz> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Jeff Layton <jlayton(at)redhat(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Kevin Grittner <kgrittn(at)ymail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Dave Chinner <david(at)fromorbit(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Joshua Drake <jd(at)commandprompt(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Mel Gorman <mgorman(at)suse(dot)de>, Jim Nasby <jim(at)nasby(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "lsf-pc(at)lists(dot)linux-foundation(dot)org" <lsf-pc(at)lists(dot)linux-foundation(dot)org>, Magnus Hagander <magnus(at)hagander(dot)net> |
Subject: | Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance |
Date: | 2014-01-21 20:20:52 |
Message-ID: | 20140121202052.GG21195@quack.suse.cz |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri 17-01-14 08:57:25, Robert Haas wrote:
> On Fri, Jan 17, 2014 at 7:34 AM, Jeff Layton <jlayton(at)redhat(dot)com> wrote:
> > So this says to me that the WAL is a place where DIO should really be
> > reconsidered. It's mostly sequential writes that need to hit the disk
> > ASAP, and you need to know that they have hit the disk before you can
> > proceed with other operations.
>
> Ironically enough, we actually *have* an option to use O_DIRECT here.
> But it doesn't work well. See below.
>
> > Also, is the WAL actually ever read under normal (non-recovery)
> > conditions or is it write-only under normal operation? If it's seldom
> > read, then using DIO for them also avoids some double buffering since
> > they wouldn't go through pagecache.
>
> This is the first problem: if replication is in use, then the WAL gets
> read shortly after it gets written. Using O_DIRECT bypasses the
> kernel cache for the writes, but then the reads stink.
OK, yes, this is hard to fix with direct IO.
> However, if you configure wal_sync_method=open_sync and disable
> replication, then you will in fact get O_DIRECT|O_SYNC behavior.
>
> But that still doesn't work out very well, because now the guy who
> does the write() has to wait for it to finish before he can do
> anything else. That's not always what we want, because WAL gets
> written out from our internal buffers for multiple different reasons.
Well, you can always use AIO (io_submit) to submit direct IO without
waiting for it to finish. But then you might need to track the outstanding
IO so that you can watch with io_getevents() when it is finished.
> If we're forcing the WAL out to disk because of transaction commit or
> because we need to write the buffer protected by a certain WAL record
> only after the WAL hits the platter, then it's fine. But sometimes
> we're writing WAL just because we've run out of internal buffer space,
> and we don't want to block waiting for the write to complete. Opening
> the file with O_SYNC deprives us of the ability to control the timing
> of the sync relative to the timing of the write.
O_SYNC has a heavy performance penalty. For ext4 it means an extra fs
transaction commit whenever there's any metadata changed on the filesystem.
Since mtime & ctime of files will be changed often, the will be a case very
often.
> > Again, I think this discussion would really benefit from an outline of
> > the different files used by pgsql, and what sort of data access
> > patterns you expect with them.
>
> I think I more or less did that in my previous email, but here it is
> again in briefer form:
>
> - WAL files are written (and sometimes read) sequentially and fsync'd
> very frequently and it's always good to write the data out to disk as
> soon as possible
> - Temp files are written and read sequentially and never fsync'd.
> They should only be written to disk when memory pressure demands it
> (but are a good candidate when that situation comes up)
> - Data files are read and written randomly. They are fsync'd at
> checkpoint time; between checkpoints, it's best not to write them
> sooner than necessary, but when the checkpoint arrives, they all need
> to get out to the disk without bringing the system to a standstill
>
> We have other kinds of files, but off-hand I'm not thinking of any
> that are really very interesting, apart from those.
>
> Maybe it'll be useful to have hints that say "always write this file
> to disk as quick as you can" and "always postpone writing this file to
> disk for as long as you can" for WAL and temp files respectively. But
> the rule for the data files, which are the really important case, is
> not so simple. fsync() is actually a fine API except that it tends to
> destroy system throughput. Maybe what we need is just for fsync() to
> be less aggressive, or a less aggressive version of it. We wouldn't
> mind waiting an almost arbitrarily long time for fsync to complete if
> other processes could still get their I/O requests serviced in a
> reasonable amount of time in the meanwhile.
As I wrote in some other email in this thread, using IO priorities for
data file checkpoint might be actually the right answer. They will work for
IO submitted by fsync(). The downside is that currently IO priorities / IO
scheduling classes work only with CFQ IO scheduler.
Honza
--
Jan Kara <jack(at)suse(dot)cz>
SUSE Labs, CR
From | Date | Subject | |
---|---|---|---|
Next Message | Alvaro Herrera | 2014-01-21 20:21:50 | Re: Closing commitfest 2013-11 |
Previous Message | Peter Geoghegan | 2014-01-21 20:19:13 | Re: Add min and max execute statement time in pg_stat_statement |