Re: core system is getting unresponsive because over 300 cpu load

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: pinker <pinker(at)onet(dot)eu>, pgsql-general(at)postgresql(dot)org
Subject: Re: core system is getting unresponsive because over 300 cpu load
Date: 2017-10-10 23:23:26
Message-ID: 85580fec-9d9d-a551-156a-7bb1a66615ed@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 10/11/2017 12:28 AM, pinker wrote:
> Tomas Vondra-4 wrote
>> What is "CPU load"? Perhaps you mean "load average"?
>
> Yes, I wasn't exact: I mean system cpu usage, it can be seen here - it's the
> graph from yesterday's failure (after 6p.m.):
> <http://www.postgresql-archive.org/file/t342733/cpu.png>
> So as one can see connections spikes follow cpu spikes...
>

I'm probably a bit dumb (after all, it's 1AM over here), but can you
explain the CPU chart? I'd understand percentages (say, 75% CPU used)
but what do the seconds / fractions mean? E.g. when the system time
reaches 5 seconds, what does that mean?

>
> Tomas Vondra-4 wrote
>> Also, what are the basic system parameters (number of cores, RAM), it's
>> difficult to help without knowing that.
>
> I have actually written everything in the first post:
> 80 CPU and 4 sockets
> over 500GB RAM
>

Apologies, missed that bit.

>
> Tomas Vondra-4 wrote
>> Well, 3M transactions over ~2h period is just ~450tps, so nothing
>> extreme. Not sure how large the transactions are, of course.
>
> It's quite a lot going on. Most of them are complicated stored procedures.
>

OK.

>
> Tomas Vondra-4 wrote
>> Something gets executed on the database. We have no idea what it is, but
>> it should be in the system logs. And you should see the process in 'top'
>> with large amounts of virtual memory ...
>
> Yes, it would be much easier if it would be just single query from the top,
> but the most cpu is eaten by the system itself and I'm not sure why. I
> suppose because of page tables size and anon pages is NUMA related.
>

Have you tried profiling using perf? That usually identifies hot spots
pretty quickly - either in PostgreSQL code or in the kernel.

>
> Tomas Vondra-4 wrote
>> Another possibility is a run-away query that consumes a lot of work_mem.
>
> It was exactly my first guess. work_mem is set to ~ 350MB and I see a lot of
> stored procedures with unnecessary WITH clauses (i.e. materialization) and
> right after it IN query with results of that (hash).
>

Depends on how much data is in the CTEs. We don't really allocate all of
work_mem at once, but bit by bit.

>
> Tomas Vondra-4 wrote
>> Measure cache hit ratio (see pg_stat_database.blks_hit and blks_read),
>> and then you can decide.
>
> Thank you for the tip. I always do it but haven't here, so the result is
> 0.992969610990056 - so increasing it is rather pointless.
>

Yeah.

>
> Tomas Vondra-4 wrote
>> You may also make the bgwriter more aggressive - that won't really
>> improve the hit ratio, it will only make enough room for the backends.
>
> yes i probably will
>

On the other hand, the numbers are rather low. I mean, the backends
seems to be evicting ~15k buffers over 5-minute period, which is pretty
much nothing (~400kB/s). I wouldn't bother by tuning this.

>
> Tomas Vondra-4 wrote
>> But I don't quite see how this could cause the severe problems you have,
>> as I assume this is kinda regular behavior on that system. Hard to say
>> without more data.
>
> I can provide you with any data you need :)
>

What I meant is that if the system evicts this amount of buffers all the
time (i.e. there doesn't seem to be any sudden spike), then it's
unlikely to be the cause (or related to it).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Andres Freund 2017-10-11 00:18:06 Re: core system is getting unresponsive because over 300 cpu load
Previous Message pinker 2017-10-10 23:00:11 Re: core system is getting unresponsive because over 300 cpu load