Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance

From: Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
To: Dave Chinner <david(at)fromorbit(dot)com>, Greg Stark <stark(at)mit(dot)edu>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Kevin Grittner <kgrittn(at)ymail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Joshua Drake <jd(at)commandprompt(dot)com>, James Bottomley <James(dot)Bottomley(at)hansenpartnership(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Mel Gorman <mgorman(at)suse(dot)de>, Jim Nasby <jim(at)nasby(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "lsf-pc(at)lists(dot)linux-foundation(dot)org" <lsf-pc(at)lists(dot)linux-foundation(dot)org>, Magnus Hagander <magnus(at)hagander(dot)net>
Subject: Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance
Date: 2014-01-14 19:03:28
Message-ID: 52D58A00.3040802@archidevsys.co.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 14/01/14 14:09, Dave Chinner wrote:
> On Mon, Jan 13, 2014 at 09:29:02PM +0000, Greg Stark wrote:
>> On Mon, Jan 13, 2014 at 9:12 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
[...]
>> The more ambitious and interesting direction is to let Postgres tell
>> the kernel what it needs to know to manage everything. To do that we
>> would need the ability to control when pages are flushed out. This is
>> absolutely necessary to maintain consistency. Postgres would need to
>> be able to mark pages as unflushable until some point in time in the
>> future when the journal is flushed. We discussed various ways that
>> interface could work but it would be tricky to keep it low enough
>> overhead to be workable.
> IMO, the concept of allowing userspace to pin dirty page cache
> pages in memory is just asking for trouble. Apart from the obvious
> memory reclaim and OOM issues, some filesystems won't be able to
> move their journals forward until the data is flushed. i.e. ordered
> mode data writeback on ext3 will have all sorts of deadlock issues
> that result from pinning pages and then issuing fsync() on another
> file which will block waiting for the pinned pages to be flushed.
>
> Indeed, what happens if you do pin_dirty_pages(fd); .... fsync(fd);?
> If fsync() blocks because there are pinned pages, and there's no
> other thread to unpin them, then that code just deadlocked. If
> fsync() doesn't block and skips the pinned pages, then we haven't
> done an fsync() at all, and so violated the expectation that users
> have that after fsync() returns their data is safe on disk. And if
> we return an error to fsync(), then what the hell does the user do
> if it is some other application we don't know about that has pinned
> the pages? And if the kernel unpins them after some time, then we
> just violated the application's consistency guarantees....
>
[...]

What if Postgres could tell the kernel how strongly that it wanted to
hold on to the pages?

Say a byte (this is arbitrary, it could be a single hint bit which meant
"please, Please, PLEASE don't flush, if that is okay with you Mr
Kernel..."), so strength would be S = (unsigned byte value)/256, so 0 <=
S < 1.

S = 0 flush now.
0 < S < 1 flush if the 'need' is greater than the S
S = 1 never flush (note a value of 1 cannot occur, as max S = 255/256)

Postgres could use low non-zero S values if it thinks that pages /might/
still be useful later, and very high values when it is /more certain/.
I am sure Postgres must sometimes know when some pages are more
important to held onto than others, hence my feeling that S should be
more than one bit.

The kernel might simply flush pages starting at ones with low values of
S working upwards until it has freed enough memory to resolve its memory
pressure. So an explicit numerical value of 'need' (as implied above)
is not required. Also any practical implementation would not use 'S' as
a float/double, but use integer values for 'S' & 'need' - assuming that
'need' did have to be an actual value, which I suspect would not be
reequired.

This way the kernel is free to flush all such pages, when sufficient
need arises - yet usually, when there is sufficient memory, the pages
will be held unflushed.

Cheers,
Gavin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2014-01-14 19:06:44 Re: Add force option to dropdb
Previous Message Alvaro Herrera 2014-01-14 18:54:55 Re: shared memory message queues