From: | Scottix <scottix(at)gmail(dot)com> |
---|---|
To: | Jeff Janes <jeff(dot)janes(at)gmail(dot)com> |
Cc: | Michael Lewis <mlewis(at)entrata(dot)com>, Postgres General <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: Optimizing Database High CPU |
Date: | 2019-05-10 20:26:08 |
Message-ID: | CANKFHZ-KVt08zRkbZnLJHprMnLLgVdLejHybECvLWCDormO1Xg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Hey,
So I finally found the culprit. Turns out to be the THP fighting with
itself.
After running on Ubuntu
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
It instantly went from a loadavg of 30 to 3
Also make sure you re-enable on reboot.
Anyway just wanted to give a followup on the issue incase anyone else is
having the same problem.
On Mon, Mar 4, 2019 at 12:03 PM Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> On Wed, Feb 27, 2019 at 5:01 PM Michael Lewis <mlewis(at)entrata(dot)com> wrote:
>
>> If those 50-100 connections are all active at once, yes, that is high.
>>> They can easily spend more time fighting each other over LWLocks,
>>> spinlocks, or cachelines rather than doing useful work. This can be
>>> exacerbated when you have multiple sockets rather than all cores in a
>>> single socket. And these problems are likely to present as high Sys times.
>>>
>>> Perhaps you can put up a connection pooler which will allow 100
>>> connections to all think they are connected at once, but forces only 12 or
>>> so to actually be active at one time, making the others transparently queue.
>>>
>>
>> Can you expound on this or refer me to someplace to read up on this?
>>
>
> Just based on my own experimentation. This is not a blanket
> recommendation, but specific to the situation that we already suspect
> there is contention, and the server is too old to have pg_stat_actvity.wait_event
> column.
>
>
>> Context, I don't want to thread jack though: I think I am seeing similar
>> behavior in our environment at times with queries that normally take
>> seconds taking 5+ minutes at times of high load. I see many queries showing
>> buffer_mapping as the LwLock type in snapshots but don't know if that may
>> be expected.
>>
>
> It sounds like your processes are fighting to reserve buffers in
> shared_buffers in which to read data pages. But those data pages are
> probably already in the OS page cache, otherwise reading it from disk would
> be slow enough that you would be seeing some type of IO wait, or buffer_io,
> rather than buffer_mapping as the dominant wait type. So I think that
> means you have most of your data in RAM, but not enough of it in
> shared_buffers. You might be in a rare situation where setting
> shared_buffers to a high fraction of RAM, rather than the usual low
> fraction, is called for. Increasing NUM_BUFFER_PARTITIONS might also be
> useful, but that requires a recompilation of the server. But do these
> spikes correlate with anything known at the application level? A change in
> the mix of queries, or a long report or maintenance operation? Maybe the
> query plans briefly toggle over to using seq scans rather than index scans
> or vice versa, which drastically changes the block access patterns?
>
>
>> In our environment PgBouncer will accept several hundred connections and
>> allow up to 100 at a time to be active on the database which are VMs with
>> ~16 CPUs allocated (some more, some less, multi-tenant and manually
>> sharded). It sounds like you are advocating for connection max very close
>> to the number of cores. I'd like to better understand the pros/cons of that
>> decision.
>>
>
> There are good reasons to allow more than that. For example, your
> application holds some transactions open briefly while it does some
> cogitation on the application-side, rather than immediately committing and
> so returning the connection to the connection pool. Or your server has a
> very high IO capacity and benefits from lots of read requests in the queue
> at the same time, so it can keep every spindle busy and every rotation
> productive. But, if you have no reason to believe that any of those
> situations apply to you, but do have evidence that you have lock contention
> between processes, then I think that limiting the number active processes
> to the number of cores is a good starting point.
>
> Cheers,
>
> Jeff
>
--
T: @Thaumion
IG: Thaumion
Scottix(at)Gmail(dot)com
From | Date | Subject | |
---|---|---|---|
Next Message | neeraj kumar | 2019-05-10 20:32:54 | Re: Query on pg_stat_activity table got stuck |
Previous Message | Erik Jones | 2019-05-10 19:46:58 | Re: Hot Standby Conflict on pg_attribute |