From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Jerry Sievers <gsievers19(at)comcast(dot)net> |
Cc: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: Up to date conventional wisdom re max shared_buffer size? |
Date: | 2017-09-20 19:13:26 |
Message-ID: | 20170920191326.2yuzf6h4zvnhlmi6@alap3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On 2017-09-20 13:00:34 -0500, Jerry Sievers wrote:
> >> Pg 9.3 on monster 2T/192 CPU Xenial thrashing
> >
> > Not sure what the word "thrashing" in that sentence means.
>
> Cases of dozens or hundreds of sessions running typical statements for
> this system but running 100% on their CPUs. Seems to be triggered by
> certain heavy weight batch jobs kicking off on this generally OLTP
> system.
>
> ISTM there might be LW lock contention happening around some sort of
> shared resource where the lock wait implementation is a CPU spinner.
Yes, we improved that a lot in 9.5, 9.6 and 10. The really bad
scenarios - I've seen 95% cpu time spent in locking - should all be
fixed.
I'd try to make sure that both transparent hugepages and zone reclaim
mode are disabled - the latter probably is already, but the former might
still cause some problems.
> > Things have improved a lot since 9.3 WRT to scalability, so I'd not
> > infer too much from 9.3 performance on a larger box.
>
> Understood. The situation got worse when we moved to the even bigger
> box also running a 4.x kernel which I presume was no where near existent
> when 9.3 was our current Pg version.
I suspect it's more the bigger box than the newer kernel. The more
sockets and cores you have, the more lock contention bites you. That's
because inter-socket / cpu transfers get more expensive with more cores.
> >> Upgrade pending but we recently started having $interesting performance
> >> issues at times looking like I/O slowness and other times apparently
> >> causing CPU spins.
> >
> > That's not something we can really usefully comment on given the amount
> > of information.
>
> Ack'd.
>
> I'd like to strace some of the spinning backends when/if we get another
> opportunity to observe the problem to see if by syscall or libfunc name
> we can learn more about what's the cause.
I think the causes are known, and fixed - don't think there's much you
can do besides upgrading, unless you want to backport a number of
complex patches yourself.
FWIW, usually perf gives better answers than strace in this type of
scenario.
> >> Anyway, shared_buffer coherency generally high but does take big dips
> >> that are sometimes sustained for seconds or even minutes.
> >
> > "shared_buffer coherency"?
>
> As measured querying pg_stat_databases and comparing total reads to read
> hits. Run frequently such as once /5-seconds and factored into a hit
> percentage. May stay up around 100% for several ticks but then go way
> down which may or not sustain.
>
> This is an OLTP app using Rails with hundreds of tables both trivial
> n structure as well as having partitions, large payloads... TOAST and
> the like.
>
> TPS can measure in the ~5-10k range.
That's cache hit rate, not coherency ;)
- Andres
From | Date | Subject | |
---|---|---|---|
Next Message | Scott Marlowe | 2017-09-20 19:21:11 | Re: Any known issues Pg 9.3 on Ubuntu Xenial kernel 4.4.0? |
Previous Message | Ken Tanzer | 2017-09-20 18:47:00 | Puzzled by UNION with unknown types |