From: | Jeff Janes <jeff(dot)janes(at)gmail(dot)com> |
---|---|
To: | Mel Gorman <mgorman(at)suse(dot)de> |
Cc: | Claudio Freire <klaussfreire(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <kgrittn(at)ymail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Joshua Drake <jd(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>, "lsf-pc(at)lists(dot)linux-foundation(dot)org" <lsf-pc(at)lists(dot)linux-foundation(dot)org> |
Subject: | Re: Linux kernel impact on PostgreSQL performance |
Date: | 2014-01-14 17:30:19 |
Message-ID: | CAMkU=1zDtxQyF+f1HU+ArMdBQRi=xv8p=1o11wjmyJX6uoaWnw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Jan 13, 2014 at 2:36 PM, Mel Gorman <mgorman(at)suse(dot)de> wrote:
> On Mon, Jan 13, 2014 at 06:27:03PM -0200, Claudio Freire wrote:
> > On Mon, Jan 13, 2014 at 5:23 PM, Jim Nasby <jim(at)nasby(dot)net> wrote:
> > > On 1/13/14, 2:19 PM, Claudio Freire wrote:
> > >>
> > >> On Mon, Jan 13, 2014 at 5:15 PM, Robert Haas <robertmhaas(at)gmail(dot)com>
> > >> wrote:
> > >>>
> > >>> On a related note, there's also the problem of double-buffering.
> When
> > >>> we read a page into shared_buffers, we leave a copy behind in the OS
> > >>> buffers, and similarly on write-out. It's very unclear what to do
> > >>> about this, since the kernel and PostgreSQL don't have intimate
> > >>> knowledge of what each other are doing, but it would be nice to solve
> > >>> somehow.
> > >>
> > >>
> > >>
> > >> There you have a much harder algorithmic problem.
> > >>
> > >> You can basically control duplication with fadvise and WONTNEED. The
> > >> problem here is not the kernel and whether or not it allows postgres
> > >> to be smart about it. The problem is... what kind of smarts
> > >> (algorithm) to use.
> > >
> > >
> > > Isn't this a fairly simple matter of when we read a page into shared
> buffers
> > > tell the kernel do forget that page? And a corollary to that for when
> we
> > > dump a page out of shared_buffers (here kernel, please put this back
> into
> > > your cache).
> >
> >
> > That's my point. In terms of kernel-postgres interaction, it's fairly
> simple.
> >
> > What's not so simple, is figuring out what policy to use. Remember,
> > you cannot tell the kernel to put some page in its page cache without
> > reading it or writing it. So, once you make the kernel forget a page,
> > evicting it from shared buffers becomes quite expensive.
>
> posix_fadvise(POSIX_FADV_WILLNEED) is meant to cover this case by
> forcing readahead.
But telling the kernel to forget a page, then telling it to read it in
again from disk because it might be needed again in the near future is
itself very expensive. We would need to hand the page to the kernel so it
has it without needing to go to disk to get it.
> If you evict it prematurely then you do get kinda
> screwed because you pay the IO cost to read it back in again even if you
> had enough memory to cache it. Maybe this is the type of kernel-postgres
> interaction that is annoying you.
>
> If you don't evict, the kernel eventually steps in and evicts the wrong
> thing. If you do evict and it was unnecessarily you pay an IO cost.
>
> That could be something we look at. There are cases buried deep in the
> VM where pages get shuffled to the end of the LRU and get tagged for
> reclaim as soon as possible. Maybe you need access to something like
> that via posix_fadvise to say "reclaim this page if you need memory but
> leave it resident if there is no memory pressure" or something similar.
> Not exactly sure what that interface would look like or offhand how it
> could be reliably implemented.
>
I think the "reclaim this page if you need memory but leave it resident if
there is no memory pressure" hint would be more useful for temporary
working files than for what was being discussed above (shared buffers).
When I do work that needs large temporary files, I often see physical
write IO spike but physical read IO does not. I interpret that to mean
that the temporary data is being written to disk to satisfy either
dirty_expire_centisecs or dirty_*bytes, but the data remains in the FS
cache and so disk reads are not needed to satisfy it. So a hint that says
"this file will never be fsynced so please ignore dirty_*bytes and
dirty_expire_centisecs. I will need it again relatively soon (but not
after a reboot), but will do so mostly sequentially, so please don't evict
this without need, but if you do need to then it is a good candidate" would
be good.
Cheers,
Jeff
From | Date | Subject | |
---|---|---|---|
Next Message | Simon Riggs | 2014-01-14 17:36:56 | Re: ALTER TABLE lock strength reduction patch is unsafe |
Previous Message | Tom Lane | 2014-01-14 17:29:53 | Re: extension_control_path |