From: | Mel Gorman <mgorman(at)suse(dot)de> |
---|---|
To: | Dave Chinner <david(at)fromorbit(dot)com> |
Cc: | Greg Stark <stark(at)mit(dot)edu>, Andres Freund <andres(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Kevin Grittner <kgrittn(at)ymail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Joshua Drake <jd(at)commandprompt(dot)com>, James Bottomley <James(dot)Bottomley(at)hansenpartnership(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "lsf-pc(at)lists(dot)linux-foundation(dot)org" <lsf-pc(at)lists(dot)linux-foundation(dot)org>, Magnus Hagander <magnus(at)hagander(dot)net> |
Subject: | Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance |
Date: | 2014-01-15 09:44:21 |
Message-ID: | 20140115094421.GF4963@suse.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
(This thread is now massive and I have not read it all yet. If anything
I say has already been discussed then whoops)
On Tue, Jan 14, 2014 at 12:09:46PM +1100, Dave Chinner wrote:
> On Mon, Jan 13, 2014 at 09:29:02PM +0000, Greg Stark wrote:
> > On Mon, Jan 13, 2014 at 9:12 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > > For one, postgres doesn't use mmap for files (and can't without major
> > > new interfaces). Frequently mmap()/madvise()/munmap()ing 8kb chunks has
> > > horrible consequences for performance/scalability - very quickly you
> > > contend on locks in the kernel.
> >
> > I may as well dump this in this thread. We've discussed this in person
> > a few times, including at least once with Ted T'so when he visited
> > Dublin last year.
> >
> > The fundamental conflict is that the kernel understands better the
> > hardware and other software using the same resources, Postgres
> > understands better its own access patterns. We need to either add
> > interfaces so Postgres can teach the kernel what it needs about its
> > access patterns or add interfaces so Postgres can find out what it
> > needs to know about the hardware context.
>
> In my experience applications don't need to know anything about the
> underlying storage hardware - all they need is for someone to
> tell them the optimal IO size and alignment to use.
>
That potentially misses details on efficient IO patterns. They might
submit many small requests for example each of which are of the optimal
IO size and alignment but which is sub-optimal overall. While these
still go through the underlying block layers there is no guarantee that
the requests will arrive in time for efficient merging to occur.
> > The more ambitious and interesting direction is to let Postgres tell
> > the kernel what it needs to know to manage everything. To do that we
> > would need the ability to control when pages are flushed out. This is
> > absolutely necessary to maintain consistency. Postgres would need to
> > be able to mark pages as unflushable until some point in time in the
> > future when the journal is flushed. We discussed various ways that
> > interface could work but it would be tricky to keep it low enough
> > overhead to be workable.
>
> IMO, the concept of allowing userspace to pin dirty page cache
> pages in memory is just asking for trouble. Apart from the obvious
> memory reclaim and OOM issues, some filesystems won't be able to
> move their journals forward until the data is flushed. i.e. ordered
> mode data writeback on ext3 will have all sorts of deadlock issues
> that result from pinning pages and then issuing fsync() on another
> file which will block waiting for the pinned pages to be flushed.
>
That applies if the dirty pages are forced to be kept dirty. You call
this pinned but pinned has special meaning so I would suggest calling it
something like dirty-sticky pages. It could be the case that such hinting
will have the pages excluded from dirty background writing but can still
be cleaned if dirty limits are hit or if fsync is called. It's a hint,
not a forced guarantee.
It's still a hand grenade because if this is tracked on a per-page basis
because of what happens if the process crashes? Those pages stay dirty
potentially forever. An alternative would be to track this on a per-inode
instead of per-page basis. The hint would only exist where there is an
open fd for that inode. Treat it as a privileged call with a sysctl
controlling how many dirty-sticky pages can exist in the system with the
information presented during OOM kills and maybe it starts becoming a bit
more manageable. Dirty-sticky pages are not guaranteed to stay dirty
until userspace action, the kernel just stays away until there are no
other sensible options.
> Indeed, what happens if you do pin_dirty_pages(fd); .... fsync(fd);?
> If fsync() blocks because there are pinned pages, and there's no
> other thread to unpin them, then that code just deadlocked.
Indeed. Forcing pages with this hint to stay dirty until user space decides
to clean them is eventually going to blow up.
> <SNIP>
> Hmmmm. What happens if the process crashes after pinning the dirty
> pages? How do we even know what process pinned the dirty pages so
> we can clean up after it? What happens if the same page is pinned by
> multiple processes? What happens on truncate/hole punch if the
> partial pages in the range that need to be zeroed and written are
> pinned? What happens if we do direct IO to a range with pinned,
> unflushable pages in the page cache?
>
Proposal: A process with an open fd can hint that pages managed by this
inode will have dirty-sticky pages. Pages will be ignored by
dirty background writing unless there is an fsync call or
dirty page limits are hit. The hint is cleared when no process
has the file open.
If the process crashes, the hint is cleared and the pages get cleaned as
normal
Multiple processes do not matter as such as all of them will have the file
open. There is a problem if the processes disagree on whether the pages
should be dirty sticky or not. The default would be that a sticky-dirty
hint takes priority although it does mean that a potentially unprivileged
process can cause problems. There would be security concerns here that
have to be taken into account.
fsync and truncrate both override the hint. fsync will write the pages,
truncate will discard them.
If there is direct IO on the range then force the sync, invalidate the
page cache, initiate the direct IO as normal.
At least one major downside is that the performance will depend on system
parameters and be non-deterministic, particularly in comparison to direct IO.
> These are all complex corner cases that are introduced by allowing
> applications to pin dirty pages in memory. I've only spent a few
> minutes coming up with these, and I'm sure there's more of them.
> As such, I just don't see that allowing userspace to pin dirty
> page cache pages in memory being a workable solution.
>
From what I've read so far, I'm not convinced they are looking for a
hard *pin* as such. They want better control over the how and the when
of writeback, not absolute control. I somewhat sympathise with their
reluctance to use direct IO when the kernel should be able to get them most,
if not all, of the potential performance.
--
Mel Gorman
SUSE Labs
From | Date | Subject | |
---|---|---|---|
Next Message | Masterprojekt Naumann1 | 2014-01-15 09:53:02 | identify table oid for an AggState during plan tree initialization |
Previous Message | Dean Rasheed | 2014-01-15 09:37:42 | Re: Failed assertion root->hasLateralRTEs on initsplan.c |