Re: Sudden drop in DBb performance

From: "Tomas Vondra" <tv(at)fuzzy(dot)cz>
To: "Gerhard Wohlgenannt" <wohlg(at)ai(dot)wu(dot)ac(dot)at>
Cc: "Tomas Vondra" <tv(at)fuzzy(dot)cz>, "Andy Colson" <andy(at)squeakycode(dot)net>, "Scott Marlowe" <scott(dot)marlowe(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org, "Heinz-Peter Lang" <heinz(at)langatium(dot)net>, "Gerhard Wohlgenannt" <wohlg(at)ai(dot)wu-wien(dot)ac(dot)at>, "Weichselbraun, Albert" <albert(dot)weichselbraun(at)wu(dot)ac(dot)at>
Subject: Re: Sudden drop in DBb performance
Date: 2011-09-06 16:50:19
Message-ID: 754520d640fa910d62cf4d0b33b58ff9.squirrel@sq.gransy.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On 6 Září 2011, 10:55, Gerhard Wohlgenannt wrote:
>
>> That's why I love dstat, just do this
>>
>> $ dstat -C 0,1,2,3,4,5,6,7
>>
>> and you know all you need.
>
> dstat looks like a very nice tool, results below ..
> (now the system load seems a bit lower then before when generating
> results for vmstat and iostat)
>>> Good catch, thanks Scott.
>> Yes, good catch.
>>
>> Still, this does not explain why the queries were running fast before,
>> and
>> why the RAID array is so sluggish. Not to mention that we don't know
>> what
>> were the conditions when collecting those numbers (were the VMs off or
>> running?).
>>
> the VMs were running. they are in something like production use, so i
> shouldn't just turn them off .. :-)
> and the processes in the VMs cause a big portion of the DB load, so
> turning them off would distort the results ...

Distort the results? If you want to measure the RAID performance, you have
to do that when there are no other processes using it.

> and thanks again for all the replies!!! :-)

Please, use something like pastebin.com to post there results. It was
bearable for the vmstat output but this is alamost unreadable due to the
wrapping.

> ~# dstat -C 0,1,2,3,4,5,6,7
> -------cpu0-usage--------------cpu1-usage--------------cpu2-usage--------------cpu3-usage--------------cpu4-usage--------------cpu5-usage--------------cpu6-usage--------------cpu7-usage------
> -dsk/total- -net/total- ---paging-- ---system--
> usr sys idl wai hiq siq:usr sys idl wai hiq siq:usr sys idl wai hiq
> siq:usr sys idl wai hiq siq:usr sys idl wai hiq siq:usr sys idl wai hiq
> siq:usr sys idl wai hiq siq:usr sys idl wai hiq siq| read writ| recv
> send| in out | int csw
> 7 1 75 17 0 0: 4 5 84 7 0 0: 5 3 80 12 0
> 0: 4 3 85 9 0 0: 7 2 75 16 0 0: 4 2 87 8 0
> 0: 7 2 75 16 0 0: 4 1 87 8 0 0|5071k 2578k| 0 0
> |9760B 9431B|2468 4126
...

But if I read that correctly, the wait for the cores is 17%, 7%, 12%, 9%,
16%, 8%, 16% and 8%, and the cores are mostly idle (idl is about 85%). So
it seems there's a low number of processes, switched between the cpus and
most of the time they're waiting for the I/O.

Given the low values for disk I/O and the iostat output we've seen before,
it's obvious there's a lot of random I/O (mostly writes).

Let's speculate for a while what could cause this (in arbitrary order):

1) Checkpoints. Something is doing a lot of writes, and with DB that often
means a checkpoint is in progress. I'm not sure about your
checkpoint_timeout, but you do have 192 segments and about 7GB of shared
buffers. That means there may be a lot of dirty buffers (even 100% of the
buffers).

You're using RAID5 and that really is not a write-friendly RAID version.
We don't know actual performance as the bonnie was run with VMs accessing
the volume, but RAID10 usually performs much better.

Enable log_checkpoints in the config and see what's going on. You can also
use iotop to see what processes are doing the writes (it might be a
background writer, ...).

2) The RAID is broken and can't handle the load it handled fine before.
This is not very likely, as you've mentioned that there were no warnings
etc.

3) There are some new jobs that do a lot of I/O. Is there anything new
that wasn't running before? I guess you'd mention that.

4) The database significantly grew in a short period of time, and the
active part now does not fit into the RAM (page cache), so the data has to
be retrieved from the disk. And it's not just about the database size,
it's about the active part of the database - if you're suddenly accessing
more data, the cache may not be large enough.

This is usually a gradual process (cache hit ratio slowly decreases as the
database grows), but if the database grew rapidly ... This could be a
caused by MVCC, i.e. there may be a lot of dead tuples - have you done a
big UPDATE / DELETE or something like that recently?

regards
Tomas

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Anibal David Acosta 2011-09-06 18:31:33 how fast index works?
Previous Message Andres Freund 2011-09-06 14:51:17 Re: Rather large LA