Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance

From: Dave Chinner <david(at)fromorbit(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Kevin Grittner <kgrittn(at)ymail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Joshua Drake <jd(at)commandprompt(dot)com>, James Bottomley <James(dot)Bottomley(at)hansenpartnership(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Mel Gorman <mgorman(at)suse(dot)de>, Jim Nasby <jim(at)nasby(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "lsf-pc(at)lists(dot)linux-foundation(dot)org" <lsf-pc(at)lists(dot)linux-foundation(dot)org>, Magnus Hagander <magnus(at)hagander(dot)net>
Subject: Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance
Date: 2014-01-14 01:09:46
Message-ID: 20140114010946.GA3431@dastard
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jan 13, 2014 at 09:29:02PM +0000, Greg Stark wrote:
> On Mon, Jan 13, 2014 at 9:12 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > For one, postgres doesn't use mmap for files (and can't without major
> > new interfaces). Frequently mmap()/madvise()/munmap()ing 8kb chunks has
> > horrible consequences for performance/scalability - very quickly you
> > contend on locks in the kernel.
>
> I may as well dump this in this thread. We've discussed this in person
> a few times, including at least once with Ted T'so when he visited
> Dublin last year.
>
> The fundamental conflict is that the kernel understands better the
> hardware and other software using the same resources, Postgres
> understands better its own access patterns. We need to either add
> interfaces so Postgres can teach the kernel what it needs about its
> access patterns or add interfaces so Postgres can find out what it
> needs to know about the hardware context.

In my experience applications don't need to know anything about the
underlying storage hardware - all they need is for someone to
tell them the optimal IO size and alignment to use.

> The more ambitious and interesting direction is to let Postgres tell
> the kernel what it needs to know to manage everything. To do that we
> would need the ability to control when pages are flushed out. This is
> absolutely necessary to maintain consistency. Postgres would need to
> be able to mark pages as unflushable until some point in time in the
> future when the journal is flushed. We discussed various ways that
> interface could work but it would be tricky to keep it low enough
> overhead to be workable.

IMO, the concept of allowing userspace to pin dirty page cache
pages in memory is just asking for trouble. Apart from the obvious
memory reclaim and OOM issues, some filesystems won't be able to
move their journals forward until the data is flushed. i.e. ordered
mode data writeback on ext3 will have all sorts of deadlock issues
that result from pinning pages and then issuing fsync() on another
file which will block waiting for the pinned pages to be flushed.

Indeed, what happens if you do pin_dirty_pages(fd); .... fsync(fd);?
If fsync() blocks because there are pinned pages, and there's no
other thread to unpin them, then that code just deadlocked. If
fsync() doesn't block and skips the pinned pages, then we haven't
done an fsync() at all, and so violated the expectation that users
have that after fsync() returns their data is safe on disk. And if
we return an error to fsync(), then what the hell does the user do
if it is some other application we don't know about that has pinned
the pages? And if the kernel unpins them after some time, then we
just violated the application's consistency guarantees....

Hmmmm. What happens if the process crashes after pinning the dirty
pages? How do we even know what process pinned the dirty pages so
we can clean up after it? What happens if the same page is pinned by
multiple processes? What happens on truncate/hole punch if the
partial pages in the range that need to be zeroed and written are
pinned? What happens if we do direct IO to a range with pinned,
unflushable pages in the page cache?

These are all complex corner cases that are introduced by allowing
applications to pin dirty pages in memory. I've only spent a few
minutes coming up with these, and I'm sure there's more of them.
As such, I just don't see that allowing userspace to pin dirty
page cache pages in memory being a workable solution.

> The less exciting, more conservative option would be to add kernel
> interfaces to teach Postgres about things like raid geometries. Then

/sys/block/<dev>/queue/* contains all the information that is
exposed to filesystems to optimise layout for storage geometry.
Some filesystems can already expose the relevant parts of this
information to userspace, others don't.

What I think we really need to provide is a generic interface
similar to the old XFS_IOC_DIOINFO ioctl that can be used to
expose IO characteristics to applications in a simple, easy to
gather manner. Something like:

struct io_info {
u64 minimum_io_size; /* sector size */
u64 maximum_io_size; /* currently 2GB */
u64 optimal_io_size; /* stripe unit/width */
u64 optimal_io_alignment; /* stripe unit/width */
u64 mem_alignment; /* PAGE_SIZE */
u32 queue_depth; /* max IO concurrency */
};

> Postgres could use directio and decide to do prefetching based on the
> raid geometry,

Underlying storage array raid geometry and optimal IO sizes for the
filesystem may be different. Hence you want what the filesystem
considers optimal, not what the underlying storage is configured
with. Indeed, a filesystem might be able to supply per-file IO
characteristics depending on where it is located in the filesystem
(think tiered storage)....

> how much available i/o bandwidth and iops is available,
> etc.

The kernel doesn't really know what a device is capable of - it can
only measure what the current IO workload is achieving - and it can
change based on the IO workload characteristics. Hence applications
can track this as well as the kernel does if they need this
information for any reason.

> Reimplementing i/o schedulers and all the rest of the work that the

Nobody needs to reimplement IO schedulers in userspace. Direct IO
still goes through the block layers where all that merging and
IO scheduling occurs.

> kernel provides inside Postgres just seems like something outside our
> competency and that none of us is really excited about doing.

That argument goes both ways - providing fine-grained control over
the page cache contents to userspace doesn't get me excited, either.
In fact, it scares the living daylights out of me. It's complex,
it's fragile and it introduces constraints into everything we do in
the kernel. Any one of those reasons is grounds for saying no to a
proposal, but this idea hits the trifecta....

I'm not saying that O_DIRECT is easy or perfect, but it seems to me
to be a more robust, secure, maintainable and simpler solution than
trying to give applications direct control over complex internal
kernel structures and algorithms.

Cheers,

Dave.
--
Dave Chinner
david(at)fromorbit(dot)com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Marti Raudsepp 2014-01-14 01:10:12 Re: Where do we stand on 9.3 bugs?
Previous Message Jim Nasby 2014-01-14 01:08:58 Re: plpgsql.consistent_into