From: | Andres Freund <andres(at)2ndquadrant(dot)com> |
---|---|
To: | James Bottomley <James(dot)Bottomley(at)HansenPartnership(dot)com> |
Cc: | Josh Berkus <josh(at)agliodbs(dot)com>, Kevin Grittner <kgrittn(at)ymail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Joshua Drake <jd(at)commandprompt(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Mel Gorman <mgorman(at)suse(dot)de>, Jim Nasby <jim(at)nasby(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "lsf-pc(at)lists(dot)linux-foundation(dot)org" <lsf-pc(at)lists(dot)linux-foundation(dot)org>, Magnus Hagander <magnus(at)hagander(dot)net> |
Subject: | Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance |
Date: | 2014-01-13 22:44:53 |
Message-ID: | 20140113224453.GE9762@awork2.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 2014-01-13 14:19:56 -0800, James Bottomley wrote:
> > Frequently mmap()/madvise()/munmap()ing 8kb chunks has
> > horrible consequences for performance/scalability - very quickly you
> > contend on locks in the kernel.
>
> Is this because of problems in the mmap_sem?
It's been a while since I looked at it, but yes, mmap_sem was part of
it. I also seem to recall the amount of IPIs increasing far too much for
it to be practical, but I am not sure anymore.
> > Also, that will mark that page dirty, which isn't what we want in this
> > case.
>
> You mean madvise (page_addr)? It shouldn't ... the state of the dirty
> bit should only be updated by actual writes. Which MADV_ primitive is
> causing the dirty marking, because we might be able to fix it (unless
> there's some weird corner case I don't know about).
Not the madvise() itself, but transplanting the buffer from postgres'
buffers to the mmap() area of the underlying file would, right?
> We also do have a way of transplanting pages: it's called splice. How
> do the semantics of splice differ from what you need?
Hm. I don't really see how splice would allow us to seed the kernel's
pagecache with content *without* marking the page as dirty in the
kernel.
We don't need zero-copy IO here, the important thing is just to fill the
pagecache with content without a) rereading the page from disk b)
marking the page as dirty.
> > One major usecase is transplanting a page comming from postgres'
> > buffers into the kernel's buffercache because the latter has a much
> > better chance of properly allocating system resources across independent
> > applications running.
>
> If you want to share pages between the application and the page cache,
> the only known interface is mmap ... perhaps we can discuss how better
> to improve mmap for you?
I think purely using mmap() is pretty unlikely to work out - there's
just too many constraints about when a page is allowed to be written out
(e.g. it's interlocked with postgres' write ahead log). I also think
that for many practical purposes using mmap() would result in an absurd
number of mappings or mapping way too huge areas; e.g. large btree
indexes are usually accessed in a quite fragmented manner.
> > Oh, and the kernel's page-cache management while far from perfect,
> > actually scales much better than postgres'.
>
> Well, then, it sounds like the best way forward would be to get
> postgress to use the kernel page cache more efficiently.
No arguments there, although working on postgres scalability is a good
idea as well ;)
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Jan Kara | 2014-01-13 22:47:40 | Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance |
Previous Message | Jan Kara | 2014-01-13 22:38:44 | Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance |