Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance

From: Hannu Krosing <hannu(at)2ndQuadrant(dot)com>
To: Dave Chinner <david(at)fromorbit(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: James Bottomley <James(dot)Bottomley(at)HansenPartnership(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, "lsf-pc(at)lists(dot)linux-foundation(dot)org" <lsf-pc(at)lists(dot)linux-foundation(dot)org>, Kevin Grittner <kgrittn(at)ymail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Joshua Drake <jd(at)commandprompt(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Mel Gorman <mgorman(at)suse(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Trond Myklebust <trondmy(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>
Subject: Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance
Date: 2014-01-14 08:08:40
Message-ID: 52D4F088.20600@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 01/14/2014 03:44 AM, Dave Chinner wrote:
> On Tue, Jan 14, 2014 at 02:26:25AM +0100, Andres Freund wrote:
>> On 2014-01-13 17:13:51 -0800, James Bottomley wrote:
>>> a file into a user provided buffer, thus obtaining a page cache entry
>>> and a copy in their userspace buffer, then insert the page of the user
>>> buffer back into the page cache as the page cache page ... that's right,
>>> isn't it postgress people?
>> Pretty much, yes. We'd probably hint (*advise(DONTNEED)) that the page
>> isn't needed anymore when reading. And we'd normally write if the page
>> is dirty.
> So why, exactly, do you even need the kernel page cache here? You've
> got direct access to the copy of data read into userspace, and you
> want direct control of when and how the data in that buffer is
> written and reclaimed. Why push that data buffer back into the
> kernel and then have to add all sorts of kernel interfaces to
> control the page you already have control of?
To let kernel do the job that it is good at, namely managing the write-back
of dirty buffers to disk and to manage (possible) read-ahead pages.

While we do have control of "the page", we do not (and really don't want to)
have control of the complex and varied side of efficiently reading and
writing
to various file-systems with possibly very different disk configurations.

We quite prefer kernel to take care of it and generally like how kernel
manages it.

We have a few suggestions about giving the kernel extra info about the
applications usage patterns of the data.
>
>>> Effectively you end up with buffered read/write that's also mapped into
>>> the page cache. It's a pretty awful way to hack around mmap.
>> Well, the problem is that you can't really use mmap() for the things we
>> do. Postgres' durability works by guaranteeing that our journal entries
>> (called WAL := Write Ahead Log) are written & synced to disk before the
>> corresponding entries of tables and indexes reach the disk. That also
>> allows to group together many random-writes into a few contiguous writes
>> fdatasync()ed at once. Only during a checkpointing phase the big bulk of
>> the data is then (slowly, in the background) synced to disk.
> Which is the exact algorithm most journalling filesystems use for
> ensuring durability of their metadata updates. Indeed, here's an
> interesting piece of architecture that you might like to consider:
>
> * Neither XFS and BTRFS use the kernel page cache to back their
> metadata transaction engines.
But file system code is supposed to know much more about the
underlying disk than a mere application program like postgresql.

We do not want to start duplicating OS if we can avoid it.

What we would like is to have a way to tell the kernel

1) "here is the modified copy of file page, it is now safe to write
it back" - the current 'lazy' write

2) "here is the page, write it back now, before returning success
to me" - unbuffered write or write + sync

but we also would like to have

3) "here is the page as it is currently on disk, I may need it soon,
so keep it together with your other clean pages accessed at time X"
- this is the non-dirtying write discussed

the page may be in buffer cache, in which case just update its LRU
position (to either current time or time provided by postgresql), or
it may not be there, in which case put it there if reasonable by it's
LRU position.

And we would like all this to work together with other current linux
kernel goodness of managing the whole disk-side interaction of
efficient reading and writing and managing the buffers :)
> Why not? Because the page cache is too simplistic to adequately
> represent the complex object heirarchies that the filesystems have
> and so it's flat LRU reclaim algorithms and writeback control
> mechanisms are a terrible fit and cause lots of performance issues
> under memory pressure.
Same is true for postgresql - if we would just use direct writes
and reads from disk then the performance would be terrible.

We would need to duplicate all the complicated algorithms in file
system do for good performance if we were to start implementing
that part of the file system ourselves.

> IOWs, the two most complex high performance transaction engines in
> the Linux kernel have moved to fully customised cache and (direct)
> IO implementations because the requirements for scalability and
> performance are far more complex than the kernel page cache
> infrastructure can provide.
And we would like to avoid implementing this again this by delegating
this part of work to said complex high performance transaction
engines in the Linux kernel.

We do not want to abandon all work for postgresql business code
and go into file system development mode for next few years.

Again, as said above the linux file system is doing fine. What we
want is a few ways to interact with it to let it do even better when
working with postgresql by telling it some stuff it otherwise would
have to second guess and by sometimes giving it back some cache
pages which were copied away for potential modifying but ended
up clean in the end.

And let the linux kernel decide if and how long to keep these pages
in its cache using its superior knowledge of disk subsystem and
about what else is going on in the system in general.

Just food for thought....

We want to have all the performance and complexity provided
by linux, and we would like it to work even better with postgresql by
having a bit more information for its decisions.

We just don't want to re-implement it ;)

--
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2014-01-14 08:09:06 Re: [PATCH] Negative Transition Aggregate Functions (WIP)
Previous Message Pavel Stehule 2014-01-14 08:08:30 Re: Add CREATE support to event triggers