Quick Links

Re: Server hitting 100% CPU usage, system comes to a crawl.

From:	Scott Mead <scottm(at)openscg(dot)com>
To:	Brian Fehrle <brianf(at)consistentstate(dot)com>
Cc:	pgsql-general General <pgsql-general(at)postgresql(dot)org>
Subject:	Re: Server hitting 100% CPU usage, system comes to a crawl.
Date:	2011-10-27 20:27:19
Message-ID:	CAKq0gvKw_dvXMYme+MG=Aj8ENoPFug+u_Jvx5nXgLig-G32JRw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

On Thu, Oct 27, 2011 at 2:39 PM, Brian Fehrle <brianf(at)consistentstate(dot)com>wrote:

> Hi all, need some help/clues on tracking down a performance issue.
>
> PostgreSQL version: 8.3.11
>
> I've got a system that has 32 cores and 128 gigs of ram. We have connection
> pooling set up, with about 100 - 200 persistent connections open to the
> database. Our applications then use these connections to query the database
> constantly, but when a connection isn't currently executing a query, it's
> <IDLE>. On average, at any given time, there are 3 - 6 connections that are
> actually executing a query, while the rest are <IDLE>.
>

Remember, when you read pg_stat_activity, it is showing you query activity
from that exact specific moment in time. Just because it looks like only
3-6 connections are executing, doesn't mean that 200 aren't actually
executing < .1ms statements. With such a beefy box, I would see if you can
examine any stats from your connection pooler to find out how many
connections are actually getting used.

>
> About once a day, queries that normally take just a few seconds slow way
> down, and start to pile up, to the point where instead of just having 3-6
> queries running at any given time, we get 100 - 200. The whole system comes
> to a crawl, and looking at top, the CPU usage is 99%.
>
> Looking at top, I see no SWAP usage, very little IOWait, and there are a
> large number of postmaster processes at 100% cpu usage (makes sense, at this
> point there are 150 or so queries currently executing on the database).
>
> Tasks: 713 total, 44 running, 668 sleeping, 0 stopped, 1 zombie
> Cpu(s): 4.4%us, 92.0%sy, 0.0%ni, 3.0%id, 0.0%wa, 0.0%hi, 0.3%si,
> 0.2%st
> Mem: 134217728k total, 131229972k used, 2987756k free, 462444k buffers
> Swap: 8388600k total, 296k used, 8388304k free, 119029580k cached
>
>
> In the past, we noticed that autovacuum was hitting some large tables at
> the same time this happened, so we turned autovacuum off to see if that was
> the issue, and it still happened without any vacuums running.
>
That was my next question :)

>
> We also ruled out checkpoints being the cause.
>
> How exactly did you rule this out? Just because a checkpoint is over
doesn't mean that it hasn't had a negative effect on the OS cache. If
you're stuck going to disk, that could be hurting you (that being said, you
do point to a low I/O wait above, so you're probably correct in ruling this
out).

>
> I'm currently digging through some statistics I've been gathering to see if
> traffic increased at all, or remained the same when the slowdown occurred.
> I'm also digging through the logs from the postgresql cluster (I increased
> verbosity yesterday), looking for any clues. Any suggestions or clues on
> where to look for this to see what can be causing a slowdown like this would
> be greatly appreciated.
>
> Are you capturing table-level stats from pg_stat_user_[tables | indexes]?
Just because a server doesn't look busy doesn't mean that you're not doing
1000 index scans per second returning 1000 tuples each time.

--Scott

> Thanks,
> - Brian F
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/**mailpref/pgsql-general<http://www.postgresql.org/mailpref/pgsql-general>
>

In response to

Server hitting 100% CPU usage, system comes to a crawl. at 2011-10-27 18:39:30 from Brian Fehrle

Responses

Re: Server hitting 100% CPU usage, system comes to a crawl. at 2011-10-27 20:36:50 from Brian Fehrle

Browse pgsql-general by date

	From	Date	Subject
Next Message	Adrian Schreyer	2011-10-27 20:31:41	Custom data type in C with one fixed and one variable attribute
Previous Message	Brian Fehrle	2011-10-27 20:15:25	Re: Server hitting 100% CPU usage, system comes to a crawl.