>>> On Fri, Aug 24, 2007 at 5:47 PM, in message
<Pine(dot)GSO(dot)4(dot)64(dot)0708241807500(dot)28499(at)westnet(dot)com>, Greg Smith
<gsmith(at)gregsmith(dot)com> wrote:
> On Fri, 24 Aug 2007, Kevin Grittner wrote:
>
>> I would be fine with that if I could configure the back end to always write
> a
>> dirty page to the OS when it is written to shared memory. That would allow
>> Linux and XFS to do their job in a timely manner, and avoid this problem.
>
> You should take a look at the "io storm on checkpoints" thread on the
> pgsql-performance(at)postgresql(dot)org started by Dmitry Potapov on 8/22 if you
> aren't on that list. He was running into the same problem as you (and me
> and lots of other people) and had an interesting resolution based on
> turning the Linux kernel so that it basically stopped caching writes.
I saw it. I think that I'd rather have a write-through cache in PostgreSQL
than give up OS caching entirely. The problem seems to be caused by the
cascade from one cache to the next, so I can easily believe that disabling
the delay on either one solves the problem.
> What you suggest here would be particularly inefficient because of how
> much extra I/O would happen on the index blocks involved in the active
> tables.
I've certainly seen that assertion on these lists often. I don't think I've
yet seen any evidence that it's true. When I made the background writer
more aggressive, there was no discernible increase in disk writes at the OS
level (much less from controller cache to the drives). This may not be true
with some of the benchmark software, but in our environment there tends to
be a lot of activity on a singe court case, and then they're done with it.
(I spent some time looking at this to tune our heuristics for generating
messages on our interfaces to business partners.)
>> I know we're doing more in 8.3 to move this from the OS's realm into
>> PostgreSQL code, but until I have a chance to test that, I want to make sure
>> that what has been proven to work for us is not broken.
>
> The background writer code that's in 8.2 can be configured as a big
> sledgehammer that happens to help in this area while doing large amounts
> of collateral damage via writing things prematurely.
Again -- to the OS cache, where it sits and accumulates other changes until
the page settles.
> I would be extremely surprised to find that the code that's already in 8.3
> isn't a big improvement over what you're doing now based on how much it
> has helped others running into this issue.
I'm certainly hoping that it will be. I'm not moving to it for production
until I've established that as a fact, however.
> And much of the code that
> you're relying on now to help with the problem (the all-scan portion of
> the BGW) has already been removed as part of that.
>
> Switching to my Agent Smith voice: "No Kevin, your old background writer
> is already dead". You'd have to produce some really unexpected and
> compelling results during the beta period for it to get put back again.
If I fail to get resources approved to test during beta, this could become
an issue later, when we do get around to testing it. (There's exactly zero
chance of us moving to something which so radically changes a problem area
for us without serious testing.)
For what it's worth, the background writer settings I'm using weren't
arrived at entirely randomly. I monitored I/O during episodes of the
database freezing up, and looked at how many writes per second were going
through. I then reasoned that there was no good reason NOT to push data out
from PostgreSQL to the OS at that speed. I split the writes between the LRU
and full cache aspects of the background writer, with heavier weight given
to getting all dirty pages pushed out to the OS cache so that they could
start to age through the OS timers. (While the raw numbers totaled to the
peak write load, I figured I was actually allowing some slack, since there
was the percentage limit and the two scans would often cover the same
ground, not to mention the assumption that the interval was a sleep time
from the end of one run to the start of the next.) Since it was a
production system, I made incremental changes each day, and each day the
problem became less severe. At the point where I finally set it to my
calculated numbers, we stopped seeing the problem.
I'm not entirely convinced that it's a sound assumption that we should
always try to keep some dirty buffers in the cache on the off chance that
we might be smarter than the OS/FS/RAID controller algorithms about when to
write them. That said, the 8.3 changes sound as though they are likely to
reduce the problems with I/O-related freezes.
Is it my imagination, or are we coming pretty close to the point where we
could accomadate the oft-requested feature of dealing directly with a raw
volume, rather than going through the file system at all?
-Kevin