From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | Greg Smith <greg(at)2ndquadrant(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: shared_buffers documentation |
Date: | 2010-04-17 02:08:04 |
Message-ID: | r2k603c8f071004161908g2bae5d83l3754862cb39a182@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, Apr 16, 2010 at 9:47 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> Robert Haas wrote:
>> Well, why can't they just hang out as dirty buffers in the OS cache,
>> which is also designed to solve this problem?
>
> If the OS were guaranteed to be as suitable for this purpose as the approach
> taken in the database, this might work. But much like the clock sweep
> approach should outperform a simpler OS caching implementation in many
> common workloads, there are a couple of spots where making dirty writes the
> OS's problem can fall down:
>
> 1) That presumes that OS write coalescing will solve the problem for you by
> merging repeat writes, which depending on implementation it might not.
>
> 2) On some filesystems, such as ext3, any write with an fsync behind it will
> flush the whole write cache out and defeat this optimization. Since the
> spread checkpoint design has some such writes going to the data disk in the
> middle of the currently processing checkpoing, in those situations that's
> likely to push the first write of that block to disk before it can be
> combined with a second. If you'd have kept it in the buffer cache it might
> survive as long as a full checkpoint cycle longer..
>
> 3) The "timeout" as it were for shared buffers is driven by the distance
> between checkpoints, typically as long as 5 minutes. The longest a
> filesystem will hold onto a write is probably less. On Linux it's typically
> 30 seconds before the OS considers a write important to get out to disk,
> longest case; if you've already filled a lot of RAM with writes it can be
> substantially less.
Thanks for the explanation. That makes sense. Does this imply that
the problems with shared_buffers being too small are going to be less
with a read-mostly load?
>> I guess the obvious question is whether Windows "doesn't need" more
>> shared memory than that, or whether it "can't effectively use" more
>> memory than that.
>
> It's probably can't effectively use. We know for a fact that applications
> where blocks regularly accumulate high usage counts and have repeat
> read/writes to them, which includes pgbench, benefit in several easy to
> measure ways from using larger amounts of database buffer cache. There's
> just plain old less churn of buffers going in and out of there. The
> alternate explanation of "Windows is just so much better at read/write
> caching that you should give it most of the RAM anyway" doesn't really sound
> as probable as the more commonly proposed theory "Windows doesn't handle
> large blocks of shared memory well".
>
> Note that there's no discussion of the why behind this is in the commit you
> just did, just the description of what happens. The reasons why are left
> undefined, which I feel is appropriate given we really don't know for sure.
> Still waiting for somebody to let loose the Visual Studio profiler and
> measure what's causing the degradation at larger sizes.
Right - my purpose in wanting to revise the documentation was not to
give a complete tutorial, which is obviously not practical, but to
give people some guidelines that are better than our previous
suggestion to use "a few tens of megabytes", which I think we've
accomplished. The follow-up questions are mostly for my own benefit
rather than the docs...
...Robert
From | Date | Subject | |
---|---|---|---|
Next Message | Scott Bailey | 2010-04-17 06:18:10 | Re: extended operator classes vs. type interfaces |
Previous Message | Robert Haas | 2010-04-17 01:47:46 | Re: Streaming replication and a disk full in primary |