From: | Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com> |
---|---|
To: | Bob Dusek <redusek(at)gmail(dot)com> |
Cc: | pgsql-performance(at)postgresql(dot)org |
Subject: | Re: performance config help |
Date: | 2010-01-11 17:34:09 |
Message-ID: | dcc563d11001110934n7c97be5cv5801289da116b87e@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
On Mon, Jan 11, 2010 at 9:42 AM, Bob Dusek <redusek(at)gmail(dot)com> wrote:
>> What method of striped RAID? RAID-5? RAID-10? RAID-4? RAID-0?
>
> RAID-0
Just wondering how many drives?
> To be more specific about the degradation, we've set the
> "log_min_duration_statement=200", and when we run with 40 concurrent
> requests, we don't see queries showing up in there. When we run with 60
> concurrent requests, we start seeing queries show up, and when we run 200+
> requests, we see multi-second queries.
>
> This is to be expected, to some extent, as we would expect some perfromance
> degradation with higher utilization. But, the hardware doesn't appear to be
> very busy, and that's where we're hoping for some help.
It's likely in io wait.
>> What do the following commands tell you?
>>
>> iostat -x 10 (first iteration doesn't count)
>
> Here's some iostat output (from the 3rd data point from iostat -x 10)...
> this was taken while we were processing 256 simultaneous requests.
>
> avg-cpu: %user %nice %system %iowait %steal %idle
> 34.29 0.00 7.09 0.03 0.00 58.58
>
> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz
> avgqu-sz await svctm %util
> sda 0.00 112.20 0.00 133.40 0.00 1964.80 14.73
> 0.42 3.17 0.04 0.48
> sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00
> sda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00
> sda3 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00
> sda4 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00
> sda5 0.00 112.20 0.00 133.40 0.00 1964.80 14.73
> 0.42 3.17 0.04 0.48
> sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00
> dm-0 0.00 0.00 0.00 0.40 0.00 3.20 8.00
> 0.00 0.00 0.00 0.00
> dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00
> dm-2 0.00 0.00 0.00 99.90 0.00 799.20 8.00
> 0.15 1.50 0.01 0.12
> dm-3 0.00 0.00 0.00 0.60 0.00 4.80 8.00
> 0.00 0.33 0.17 0.01
> dm-4 0.00 0.00 0.00 144.70 0.00 1157.60 8.00
> 0.46 3.17 0.02 0.35
> dm-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00
>
> The iowait seems pretty low, doesn't it?
Depends, is that the first iteration of output? if so, ignore it and
show me the second and further on. Same for vmstat... In fact let
them run for a minute or two and attach the results... OTOH, if that
is the second or later set of output, then you're definitely not IO
bound, and I don't see why the CPUs are not being better utilized.
How many concurrent queries are you running when you take these
measurements? Can you take them with lower and higher numbers of
concurrent users and compare the two? normally I'd be looking for
context switching taking more and more time in a heavily loaded
system, but I'm not seeing it in your vmstat numbers either.
What are your settings for
effective_cache_size
random_page_cost
work_mem
with your machine and the extra memory, you can probably uptune the
work_mem to 8 Megs safely if it's at the default of 1MB. With a
database that fits in RAM, you can often turn down random_page_cost to
near 1.0 (1.2 to 1.4 is common for such scenarios.) And effective
cache size being larger (in the 20G range) will hint the planner that
it's likely to find everything it needs in ram somewhere and not on
the disk.
There are several common bottlenecks you can try to tune away from.
IO doesn't look like a problem for you. Neither does CPU load. So,
then we're left with context switching time and memory to CPU
bandwidth. If your CPUs are basically becoming data pumps then the
speed of your FSB becomes VERY critical, and some older Intel mobos
didn't have a lot of CPU to Memory bandwidth and adding CPUs made it
worse, not better. More modern Intel chipsets have much faster CPU to
Memory BW, since they're using the same kind of fabric switching that
AMD uses on highly parallel machines.
If your limit is your hardware, then the only solution is a faster
machine. It may well be that a machine with dual fast Nehalem
(2.4GHz+) quad core CPUs will be faster. Or 4 or 8 AMD CPUs with
their faster fabric.
From | Date | Subject | |
---|---|---|---|
Next Message | Scott Marlowe | 2010-01-11 17:47:10 | Re: performance config help |
Previous Message | Rayson Ho | 2010-01-11 17:24:43 | cache false-sharing in lwlocks |