From: | Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com> |
---|---|
To: | Anne Rosset <arosset(at)collab(dot)net> |
Cc: | pgsql-performance(at)postgresql(dot)org |
Subject: | Re: FW: Queries becoming slow under heavy load |
Date: | 2011-02-03 18:22:24 |
Message-ID: | AANLkTikHuTAVCwnhjC5LmjTEf12Z76XPS2Q6SCO4LsbM@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
Excellent! And you learned a bit more about how to monitor your
server while you were at it. Win win!
On Thu, Feb 3, 2011 at 10:40 AM, Anne Rosset <arosset(at)collab(dot)net> wrote:
> Thanks to all of you who replied and pointed NFS as a potential
> culprit.
> Our issue was that pgsql's temp dir (pgsql_tmp) was set to the default
> value ( $PSQL_DIR/base/pgsql_tmp/) which was located in NFS.
> Moving the temp dir to local disk got us a huge improvement.
>
> Anne
>
> -----Original Message-----
> From: Shaun Thomas [mailto:sthomas(at)peak6(dot)com]
> Sent: Friday, January 28, 2011 7:31 AM
> To: Anne Rosset
> Cc: pgsql-performance(at)postgresql(dot)org
> Subject: Re: FW: [PERFORM] Queries becoming slow under heavy load
>
> On 01/27/2011 11:12 PM, Anne Rosset wrote:
>
>> Thanks for your response.
>> We are over NFS for our storage ...
>
> NFS? I'm not sure you know this, but NFS has major locking issues that
> would make it a terrible candidate for hosting a database.
>
>> and it's not until around the 221 second mark that we see catch it
> consuming CPU:
>>
>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>> 7090 root 25 0 689m 399m 10m R 93.4 5.0 3872:07 java
>> 28312 postgres 16 0 396m 225m 204m R 5.7 2.8 0:51.52
> postmaster<----- here
>> 3391 root 15 0 29056 2348 1424 R 1.9 0.0 0:00.01 top
>> 4297 root 16 0 10228 740 632 D 0.0 0.0 12:53.66
> hald-addon-stor
>> 26885 httpd 15 0 2263m 1.5g 16m R 0.0 19.0 0:00.01 java
>>
>> Note that the load average is fine during this timeframe, ~4 out of 8,
> so plenty of CPU.
>
> Please listen to us. We asked you to use sar, or iostat, to tell us how
> much the disk IO is being utilized. From your other screenshots, there
> were at least two other PG processes that were running and could have
> been thrashing the disk or locking tables your "slow" query needed. If
> it's waiting for disk IO, the CPU will remain low until it gets what it
> needs.
>
> Not everything is about the CPU. Especially now that we know your DB is
> running on top of NFS.
>
>> Further, or worse yet, this same behavior expands out to multiple
>> processes, producing a true "back up". It can look something like
>> this. Notice the 0% cpu consumption:
>>
>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>> 7090 root 22 0 689m 399m 10m R 91.1 5.0 3874:32 java
>> 4139 root 15 0 29080 2344 1424 R 1.9 0.0 0:00.01 top
>> 1555 postgres 16 0 474m 258m 162m D 0.0 3.2 0:17.32
> postmaster
>> 1846 postgres 16 0 474m 285m 189m D 0.0 3.6 0:47.43
> postmaster
>> 2713 postgres 16 0 404m 202m 179m D 0.0 2.5 0:33.54
> postmaster
>> 2801 postgres 16 0 391m 146m 131m D 0.0 1.8 0:04.48
> postmaster
>> 2804 postgres 16 0 419m 172m 133m D 0.0 2.2 0:09.41
> postmaster
>> 2825 postgres 16 0 473m 142m 49m D 0.0 1.8 0:04.12
> postmaster
>
> Yes. And they could all be waiting for IO. Or NFS locking is blocking
> the reads. Or... what is that Java app doing? We don't know the state of
> your IO, and when you have 0% or very low CPU usage, you either have
> locking contention or you're being IO starved.
>
> And what queries are these connections performing? You can check it by
> getting the contents of the pg_stat_activity system view. If they're
> selecting and still "slow", compare that against the iostat or sar
> results. For instance, here's an IOSTAT of our system:
>
> iostat -dmx dm-9 1
>
> Linux 2.6.18-92.el5 (oslchi6pedb1) 01/28/2011
>
> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
> avgqu-sz await svctm %util
> dm-9 0.00 0.00 125.46 227.78 4.95 0.89 33.88
> 0.08 0.19 0.08 2.91
>
> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
> avgqu-sz await svctm %util
> dm-9 0.00 0.00 5.00 0.00 0.04 0.00 14.40
> 0.05 10.60 10.60 5.30
>
> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
> avgqu-sz await svctm %util
> dm-9 0.00 0.00 2.00 0.00 0.02 0.00 16.00
> 0.01 7.00 7.00 1.40
>
> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
> avgqu-sz await svctm %util
> dm-9 0.00 0.00 4.00 1184.00 0.04 4.62 8.04
> 27.23 11.73 0.06 6.80
>
> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
> avgqu-sz await svctm %util
> dm-9 0.00 0.00 11.00 847.00 0.09 3.31 8.10
> 29.31 49.65 0.79 67.90
>
>
> That last column, %util, effectively tells us how saturated the
> controller is. If the percentage is high, it's really working hard to
> supply the data we're asking for, or trying to write. If it's low, we're
> probably working from memory cache, or getting less requests. There have
> been times our queries are "slow" and when we check this stat, it's
> often at or above 90%, sometimes for minutes at a time. That's almost
> always a clear indicator you have IO contention. Queries can't work
> without the data they need to return your results.
>
> Sending us more CPU charts isn't going to help us in helping you.
>
> --
> Shaun Thomas
> OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
> 312-676-8870
> sthomas(at)peak6(dot)com
>
> ______________________________________________
>
> See http://www.peak6.com/email_disclaimer.php
> for terms and conditions related to this email
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>
--
To understand recursion, one must first understand recursion.
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2011-02-03 18:42:39 | Re: How to best use 32 15k.7 300GB drives? |
Previous Message | Scott Marlowe | 2011-02-03 18:21:18 | Re: getting the most of out multi-core systems for repeated complex SELECT statements |