From: | Stephen Frost <sfrost(at)snowman(dot)net> |
---|---|
To: | Claudio Freire <klaussfreire(at)gmail(dot)com> |
Cc: | Robert Haas <robertmhaas(at)gmail(dot)com>, James Bottomley <James(dot)Bottomley(at)hansenpartnership(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, "lsf-pc(at)lists(dot)linux-foundation(dot)org" <lsf-pc(at)lists(dot)linux-foundation(dot)org>, Kevin Grittner <kgrittn(at)ymail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Dave Chinner <david(at)fromorbit(dot)com>, Joshua Drake <jd(at)commandprompt(dot)com>, Mel Gorman <mgorman(at)suse(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Trond Myklebust <trondmy(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net> |
Subject: | Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance |
Date: | 2014-01-15 18:41:14 |
Message-ID: | 20140115184113.GK2686@tamriel.snowman.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
* Claudio Freire (klaussfreire(at)gmail(dot)com) wrote:
> But, still, the implementation is very similar to what postgres needs:
> sharing a physical page for two distinct logical pages, efficiently,
> with efficient copy-on-write.
Agreed, except that KSM seems like it'd be slow/lazy about it and I'm
guessing there's a reason the pagecache isn't included normally..
> So it'd be just a matter of removing that limitation regarding page
> cache and shared pages.
Any idea why that limitation is there?
> If you asked me, I'd implement it as copy-on-write on the page cache
> (not the user page). That ought to be low-overhead.
Not entirely sure I'm following this- if it's a shared page, it doesn't
matter who starts writing to it, as soon as that happens, it need to get
copied. Perhaps you mean that the application should keep the
"original" and that the page-cache should get the "copy" (or, really,
perhaps just forget about the page existing at that point- we won't want
it again...).
Would that be a way to go, perhaps? This does go back to the "make it
act like mmap, but not *be* mmap", but the idea would be:
open(..., O_ZEROCOPY_READ)
read() - Goes to PG's shared buffers, pagecache and PG share the page
page fault (PG writes to it) - pagecache forgets about the page
write() / fsync() - operate as normal
The differences here from O_DIRECT are that the pagecache will keep the
page while clean (absolutely valuable from PG's perspective- we might
have to evict the page from shared buffers sooner than the kernel does),
and the write()'s happen at the kernel's pace, allowing for
write-combining, etc, until an fsync() happens, of course.
This isn't the "big win" of dealing with I/O issues during checkpoints
that we'd like to see, but it certainly feels like it'd be an
improvement over the current double-buffering situation at least.
Thanks,
Stephen
From | Date | Subject | |
---|---|---|---|
Next Message | Josh Berkus | 2014-01-15 18:53:08 | Why conf.d should be default, and auto.conf and recovery.conf should be in it |
Previous Message | Peter Eisentraut | 2014-01-15 18:36:14 | Re: tests for client programs |