Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance

From: Jan Kara <jack(at)suse(dot)cz>
To: Dave Chinner <david(at)fromorbit(dot)com>
Cc: Jan Kara <jack(at)suse(dot)cz>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jeff Layton <jlayton(at)redhat(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Kevin Grittner <kgrittn(at)ymail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Joshua Drake <jd(at)commandprompt(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Mel Gorman <mgorman(at)suse(dot)de>, Jim Nasby <jim(at)nasby(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "lsf-pc(at)lists(dot)linux-foundation(dot)org" <lsf-pc(at)lists(dot)linux-foundation(dot)org>, Magnus Hagander <magnus(at)hagander(dot)net>
Subject: Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance
Date: 2014-01-21 23:49:34
Message-ID: 20140121234934.GI21195@quack.suse.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed 22-01-14 09:07:19, Dave Chinner wrote:
> On Tue, Jan 21, 2014 at 09:20:52PM +0100, Jan Kara wrote:
> > > If we're forcing the WAL out to disk because of transaction commit or
> > > because we need to write the buffer protected by a certain WAL record
> > > only after the WAL hits the platter, then it's fine. But sometimes
> > > we're writing WAL just because we've run out of internal buffer space,
> > > and we don't want to block waiting for the write to complete. Opening
> > > the file with O_SYNC deprives us of the ability to control the timing
> > > of the sync relative to the timing of the write.
> > O_SYNC has a heavy performance penalty. For ext4 it means an extra fs
> > transaction commit whenever there's any metadata changed on the filesystem.
> > Since mtime & ctime of files will be changed often, the will be a case very
> > often.
>
> Therefore: O_DATASYNC.
O_DSYNC to be exact.

> > > Maybe it'll be useful to have hints that say "always write this file
> > > to disk as quick as you can" and "always postpone writing this file to
> > > disk for as long as you can" for WAL and temp files respectively. But
> > > the rule for the data files, which are the really important case, is
> > > not so simple. fsync() is actually a fine API except that it tends to
> > > destroy system throughput. Maybe what we need is just for fsync() to
> > > be less aggressive, or a less aggressive version of it. We wouldn't
> > > mind waiting an almost arbitrarily long time for fsync to complete if
> > > other processes could still get their I/O requests serviced in a
> > > reasonable amount of time in the meanwhile.
> > As I wrote in some other email in this thread, using IO priorities for
> > data file checkpoint might be actually the right answer. They will work for
> > IO submitted by fsync(). The downside is that currently IO priorities / IO
> > scheduling classes work only with CFQ IO scheduler.
>
> And I don't see it being implemented anywhere else because it's the
> priority aware scheduling infrastructure in CFQ that causes all the
> problems with IO concurrency and scalability...
So CFQ has all sorts of problems but I never had the impression that
priority aware scheduling is the culprit. It is all just complex - sync
idling, seeky writer detection, cooperating threads detection, sometimes
even sync vs async distinction isn't exactly what one would want. And I'm
not speaking about the cgroup stuff... So it doesn't seem to me that some
other IO scheduler couldn't reasonably efficiently implement stuff like IO
scheduling classes.

Honza
--
Jan Kara <jack(at)suse(dot)cz>
SUSE Labs, CR

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2014-01-21 23:53:20 Re: new json funcs
Previous Message Andres Freund 2014-01-21 23:43:29 Re: Hard limit on WAL space used (because PANIC sucks)