From: | pinker <pinker(at)onet(dot)eu> |
---|---|
To: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: core system is getting unresponsive because over 300 cpu load |
Date: | 2017-10-10 22:28:52 |
Message-ID: | 1507674532671-0.post@n3.nabble.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Tomas Vondra-4 wrote
> What is "CPU load"? Perhaps you mean "load average"?
Yes, I wasn't exact: I mean system cpu usage, it can be seen here - it's the
graph from yesterday's failure (after 6p.m.):
<http://www.postgresql-archive.org/file/t342733/cpu.png>
So as one can see connections spikes follow cpu spikes...
Tomas Vondra-4 wrote
> Also, what are the basic system parameters (number of cores, RAM), it's
> difficult to help without knowing that.
I have actually written everything in the first post:
80 CPU and 4 sockets
over 500GB RAM
Tomas Vondra-4 wrote
> Well, 3M transactions over ~2h period is just ~450tps, so nothing
> extreme. Not sure how large the transactions are, of course.
It's quite a lot going on. Most of them are complicated stored procedures.
Tomas Vondra-4 wrote
> Something gets executed on the database. We have no idea what it is, but
> it should be in the system logs. And you should see the process in 'top'
> with large amounts of virtual memory ...
Yes, it would be much easier if it would be just single query from the top,
but the most cpu is eaten by the system itself and I'm not sure why. I
suppose because of page tables size and anon pages is NUMA related.
Tomas Vondra-4 wrote
> Another possibility is a run-away query that consumes a lot of work_mem.
It was exactly my first guess. work_mem is set to ~ 350MB and I see a lot of
stored procedures with unnecessary WITH clauses (i.e. materialization) and
right after it IN query with results of that (hash).
Tomas Vondra-4 wrote
> Measure cache hit ratio (see pg_stat_database.blks_hit and blks_read),
> and then you can decide.
Thank you for the tip. I always do it but haven't here, so the result is
0.992969610990056 - so increasing it is rather pointless.
Tomas Vondra-4 wrote
> You may also make the bgwriter more aggressive - that won't really
> improve the hit ratio, it will only make enough room for the backends.
yes i probably will
Tomas Vondra-4 wrote
> But I don't quite see how this could cause the severe problems you have,
> as I assume this is kinda regular behavior on that system. Hard to say
> without more data.
I can provide you with any data you need :)
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-general mailing list (pgsql-general@)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
--
Sent from: http://www.postgresql-archive.org/PostgreSQL-general-f1843780.html
From | Date | Subject | |
---|---|---|---|
Next Message | John R Pierce | 2017-10-10 22:41:26 | Re: core system is getting unresponsive because over 300 cpu load |
Previous Message | Victor Yegorov | 2017-10-10 22:20:06 | Re: core system is getting unresponsive because over 300 cpu load |