Re: shared_buffers documentation

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: shared_buffers documentation
Date: 2010-04-17 02:08:04
Message-ID: r2k603c8f071004161908g2bae5d83l3754862cb39a182@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Apr 16, 2010 at 9:47 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> Robert Haas wrote:
>> Well, why can't they just hang out as dirty buffers in the OS cache,
>> which is also designed to solve this problem?
>
> If the OS were guaranteed to be as suitable for this purpose as the approach
> taken in the database, this might work.  But much like the clock sweep
> approach should outperform a simpler OS caching implementation in many
> common workloads, there are a couple of spots where making dirty writes the
> OS's problem can fall down:
>
> 1) That presumes that OS write coalescing will solve the problem for you by
> merging repeat writes, which depending on implementation it might not.
>
> 2) On some filesystems, such as ext3, any write with an fsync behind it will
> flush the whole write cache out and defeat this optimization.  Since the
> spread checkpoint design has some such writes going to the data disk in the
> middle of the currently processing checkpoing, in those situations that's
> likely to push the first write of that block to disk before it can be
> combined with a second.  If you'd have kept it in the buffer cache it might
> survive as long as a full checkpoint cycle longer..
>
> 3) The "timeout" as it were for shared buffers is driven by the distance
> between checkpoints, typically as long as 5 minutes.  The longest a
> filesystem will hold onto a write is probably less.  On Linux it's typically
> 30 seconds before the OS considers a write important to get out to disk,
> longest case; if you've already filled a lot of RAM with writes it can be
> substantially less.

Thanks for the explanation. That makes sense. Does this imply that
the problems with shared_buffers being too small are going to be less
with a read-mostly load?

>> I guess the obvious question is whether Windows "doesn't need" more
>> shared memory than that, or whether it "can't effectively use" more
>> memory than that.
>
> It's probably can't effectively use.  We know for a fact that applications
> where blocks regularly accumulate high usage counts and have repeat
> read/writes to them, which includes pgbench, benefit in several easy to
> measure ways from using larger amounts of database buffer cache.  There's
> just plain old less churn of buffers going in and out of there.  The
> alternate explanation of "Windows is just so much better at read/write
> caching that you should give it most of the RAM anyway" doesn't really sound
> as probable as the more commonly proposed theory "Windows doesn't handle
> large blocks of shared memory well".
>
> Note that there's no discussion of the why behind this is in the commit you
> just did, just the description of what happens.  The reasons why are left
> undefined, which I feel is appropriate given we really don't know for sure.
>  Still waiting for somebody to let loose the Visual Studio profiler and
> measure what's causing the degradation at larger sizes.

Right - my purpose in wanting to revise the documentation was not to
give a complete tutorial, which is obviously not practical, but to
give people some guidelines that are better than our previous
suggestion to use "a few tens of megabytes", which I think we've
accomplished. The follow-up questions are mostly for my own benefit
rather than the docs...

...Robert

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Scott Bailey 2010-04-17 06:18:10 Re: extended operator classes vs. type interfaces
Previous Message Robert Haas 2010-04-17 01:47:46 Re: Streaming replication and a disk full in primary