From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | Jan Kara <jack(at)suse(dot)cz> |
Cc: | Jeff Layton <jlayton(at)redhat(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Kevin Grittner <kgrittn(at)ymail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Dave Chinner <david(at)fromorbit(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Joshua Drake <jd(at)commandprompt(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Mel Gorman <mgorman(at)suse(dot)de>, Jim Nasby <jim(at)nasby(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "lsf-pc(at)lists(dot)linux-foundation(dot)org" <lsf-pc(at)lists(dot)linux-foundation(dot)org>, Magnus Hagander <magnus(at)hagander(dot)net> |
Subject: | Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance |
Date: | 2014-01-22 15:48:10 |
Message-ID: | CA+TgmoYZm5ExTufMfrHTJJ=h8w6WAndTvgVs3QMRpzMbufB6hw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Jan 21, 2014 at 3:20 PM, Jan Kara <jack(at)suse(dot)cz> wrote:
>> But that still doesn't work out very well, because now the guy who
>> does the write() has to wait for it to finish before he can do
>> anything else. That's not always what we want, because WAL gets
>> written out from our internal buffers for multiple different reasons.
> Well, you can always use AIO (io_submit) to submit direct IO without
> waiting for it to finish. But then you might need to track the outstanding
> IO so that you can watch with io_getevents() when it is finished.
Yeah. That wouldn't work well for us; the process that did the
io_submit() would want to move on to other things, and how would it,
or any other process, know that the I/O had completed?
> As I wrote in some other email in this thread, using IO priorities for
> data file checkpoint might be actually the right answer. They will work for
> IO submitted by fsync(). The downside is that currently IO priorities / IO
> scheduling classes work only with CFQ IO scheduler.
IMHO, the problem is simpler than that: no single process should be
allowed to completely screw over every other process on the system.
When the checkpointer process starts calling fsync(), the system
begins writing out the data that needs to be fsync()'d so aggressively
that service times for I/O requests from other process go through the
roof. It's difficult for me to imagine that any application on any
I/O scheduler is ever happy with that behavior. We shouldn't need to
sprinkle of fsync() calls with special magic juju sauce that says
"hey, when you do this, could you try to avoid causing the rest of the
system to COMPLETELY GRIND TO A HALT?". That should be the *default*
behavior, if not the *only* behavior.
Now, that is not to say that we're unwilling to sprinkle magic juju
sauce if that's what it takes to solve this problem. If calling
fadvise() or sync_file_range() or some new API that you invent at some
point prior to calling fsync() helps the kernel do the right thing,
we're willing to do that. Or if you/the Linux community wants to
invent a new API fsync_but_do_not_crush_system() and have us call that
instead of the regular fsync(), we're willing to do that, too. But I
think there's an excellent case to be made, at least as far as
checkpoint I/O spikes are concerned, that the API is just fine as it
is and Linux's implementation is simply naive. We'd be perfectly
happy to wait longer for fsync() to complete in exchange for not
starving the rest of the system - and really, who wouldn't? Linux is
a multi-user system, and apportioning resources among multiple tasks
is a basic function of a multi-user kernel.
</rant>
Anyway, if CFQ or any other Linux I/O scheduler gets an option to
lower the priority of the fsyncs, I'm sure somebody here will test it
out and see whether it solves this problem. AFAICT, experiments to
date have pretty much universally shown CFQ to be worse than not-CFQ
and everything else to be more or less equivalent - but if that
changes, I'm sure many PostgreSQL DBAs will be more than happy to flip
CFQ back on.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2014-01-22 15:48:58 | Re: Changeset Extraction v7.0 (was logical changeset generation) |
Previous Message | Andres Freund | 2014-01-22 15:41:55 | Re: Hard limit on WAL space used (because PANIC sucks) |