From: | Jan Kara <jack(at)suse(dot)cz> |
---|---|
To: | Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> |
Cc: | Mel Gorman <mgorman(at)suse(dot)de>, Josh Berkus <josh(at)agliodbs(dot)com>, Kevin Grittner <kgrittn(at)ymail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Joshua Drake <jd(at)commandprompt(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "lsf-pc(at)lists(dot)linux-foundation(dot)org" <lsf-pc(at)lists(dot)linux-foundation(dot)org>, Magnus Hagander <magnus(at)hagander(dot)net> |
Subject: | Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance |
Date: | 2014-01-14 13:07:44 |
Message-ID: | 20140114130744.GD21327@quack.suse.cz |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue 14-01-14 11:11:28, Heikki Linnakangas wrote:
> On 01/14/2014 12:26 AM, Mel Gorman wrote:
> >On Mon, Jan 13, 2014 at 03:15:16PM -0500, Robert Haas wrote:
> >>The other thing that comes to mind is the kernel's caching behavior.
> >>We've talked a lot over the years about the difficulties of getting
> >>the kernel to write data out when we want it to and to not write data
> >>out when we don't want it to.
> >
> >Is sync_file_range() broke?
> >
> >>When it writes data back to disk too
> >>aggressively, we get lousy throughput because the same page can get
> >>written more than once when caching it for longer would have allowed
> >>write-combining.
> >
> >Do you think that is related to dirty_ratio or dirty_writeback_centisecs?
> >If it's dirty_writeback_centisecs then that would be particularly tricky
> >because poor interactions there would come down to luck basically.
>
> >>When it doesn't write data to disk aggressively
> >>enough, we get huge latency spikes at checkpoint time when we call
> >>fsync() and the kernel says "uh, what? you wanted that data *on the
> >>disk*? sorry boss!" and then proceeds to destroy the world by starving
> >>the rest of the system for I/O for many seconds or minutes at a time.
> >
> >Ok, parts of that are somewhat expected. It *may* depend on the
> >underlying filesystem. Some of them handle fsync better than others. If
> >you are syncing the whole file though when you call fsync then you are
> >potentially burned by having to writeback dirty_ratio amounts of memory
> >which could take a substantial amount of time.
> >
> >>We've made some desultory attempts to use sync_file_range() to improve
> >>things here, but I'm not sure that's really the right tool, and if it
> >>is we don't know how to use it well enough to obtain consistent
> >>positive results.
> >
> >That implies that either sync_file_range() is broken in some fashion we
> >(or at least I) are not aware of and that needs kicking.
>
> Let me try to explain the problem: Checkpoints can cause an I/O
> spike, which slows down other processes.
>
> When it's time to perform a checkpoint, PostgreSQL will write() all
> dirty buffers from the PostgreSQL buffer cache, and finally perform
> an fsync() to flush the writes to disk. After that, we know the data
> is safely on disk.
>
> In older PostgreSQL versions, the write() calls would cause an I/O
> storm as the OS cache quickly fills up with dirty pages, up to
> dirty_ratio, and after that all subsequent write()s block. That's OK
> as far as the checkpoint is concerned, but it significantly slows
> down queries running at the same time. Even a read-only query often
> needs to write(), to evict a dirty page from the buffer cache to
> make room for a different page. We made that less painful by adding
> sleeps between the write() calls, so that they are trickled over a
> long period of time and hopefully stay below dirty_ratio at all
> times.
Hum, I wonder whether you see any difference with reasonably recent
kernels (say newer than 3.2). Because those have IO-less dirty throttling.
That means that:
a) checkpointing thread (or other threads blocked due to dirty limit)
won't issue IO on their own but rather wait for flusher thread to do the
work.
b) there should be more noticeable difference between the delay imposed
on heavily dirtying thread (i.e. the checkpointing thread) and the delay
imposed on lightly dirtying thread (that's what I would expect from those
threads having to do occasional page eviction to make room for other page).
> However, we still have to perform the fsync()s after the
> writes(), and sometimes that still causes a similar I/O storm.
Because there is still quite some dirty data in the page cache or because
e.g. ext3 has to flush a lot of unrelated dirty data?
> The checkpointer is not in a hurry. A checkpoint typically has 10-30
> minutes to finish, before it's time to start the next checkpoint,
> and even if it misses that deadline that's not too serious either.
> But the OS doesn't know that, and we have no way of telling it.
>
> As a quick fix, some sort of a lazy fsync() call would be nice. It
> would behave just like fsync() but it would not change the I/O
> scheduling at all. Instead, it would sleep until all the pages have
> been flushed to disk, at the speed they would've been without the
> fsync() call.
>
> Another approach would be to give the I/O that the checkpointer
> process initiates a lower priority. This would be slightly
> preferable, because PostgreSQL could then issue the writes() as fast
> as it can, and have the checkpoint finish earlier when there's not
> much other load. Last I looked into this (which was a long time
> ago), there was no suitable priority system for writes, only reads.
Well, IO priority works for writes in principle, the trouble is it
doesn't work for writes which end up just in the page cache. Then writeback
of page cache is usually done by flusher thread so that's completely
disconnected from whoever created the dirty data (now I know this is dumb
and long term we want to do something about it so that IO cgroups work
reasonably reliably but it is a tough problem, lots of complexity for not so
great gain...).
However, if you really issue the IO from the thread with low priority, it
will have low priority. So specifically if you call fsync() from a thread
with low IO priority, the flushing done by fsync() will have this low
IO priority.
Similarly if you called sync_file_range() once in a while from a thread
with low IO priority, the flushing IO will have low IO priority. But I
would be really careful about the periodic sync_file_range() calls - it has
a potential of mixing with writeback from flusher thread and mixing these
two on different parts of a file can lead to bad IO patterns...
Honza
--
Jan Kara <jack(at)suse(dot)cz>
SUSE Labs, CR
From | Date | Subject | |
---|---|---|---|
Next Message | Magnus Hagander | 2014-01-14 13:12:46 | Re: Extending BASE_BACKUP in replication protocol: incremental backup and backup format |
Previous Message | Heikki Linnakangas | 2014-01-14 13:01:29 | Re: Extending BASE_BACKUP in replication protocol: incremental backup and backup format |