Re: Up to date conventional wisdom re max shared_buffer size?

From: Andres Freund <andres(at)anarazel(dot)de>
To: Jerry Sievers <gsievers19(at)comcast(dot)net>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Up to date conventional wisdom re max shared_buffer size?
Date: 2017-09-20 19:13:26
Message-ID: 20170920191326.2yuzf6h4zvnhlmi6@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 2017-09-20 13:00:34 -0500, Jerry Sievers wrote:
> >> Pg 9.3 on monster 2T/192 CPU Xenial thrashing
> >
> > Not sure what the word "thrashing" in that sentence means.
>
> Cases of dozens or hundreds of sessions running typical statements for
> this system but running 100% on their CPUs. Seems to be triggered by
> certain heavy weight batch jobs kicking off on this generally OLTP
> system.
>
> ISTM there might be LW lock contention happening around some sort of
> shared resource where the lock wait implementation is a CPU spinner.

Yes, we improved that a lot in 9.5, 9.6 and 10. The really bad
scenarios - I've seen 95% cpu time spent in locking - should all be
fixed.

I'd try to make sure that both transparent hugepages and zone reclaim
mode are disabled - the latter probably is already, but the former might
still cause some problems.

> > Things have improved a lot since 9.3 WRT to scalability, so I'd not
> > infer too much from 9.3 performance on a larger box.
>
> Understood. The situation got worse when we moved to the even bigger
> box also running a 4.x kernel which I presume was no where near existent
> when 9.3 was our current Pg version.

I suspect it's more the bigger box than the newer kernel. The more
sockets and cores you have, the more lock contention bites you. That's
because inter-socket / cpu transfers get more expensive with more cores.

> >> Upgrade pending but we recently started having $interesting performance
> >> issues at times looking like I/O slowness and other times apparently
> >> causing CPU spins.
> >
> > That's not something we can really usefully comment on given the amount
> > of information.
>
> Ack'd.
>
> I'd like to strace some of the spinning backends when/if we get another
> opportunity to observe the problem to see if by syscall or libfunc name
> we can learn more about what's the cause.

I think the causes are known, and fixed - don't think there's much you
can do besides upgrading, unless you want to backport a number of
complex patches yourself.

FWIW, usually perf gives better answers than strace in this type of
scenario.

> >> Anyway, shared_buffer coherency generally high but does take big dips
> >> that are sometimes sustained for seconds or even minutes.
> >
> > "shared_buffer coherency"?
>
> As measured querying pg_stat_databases and comparing total reads to read
> hits. Run frequently such as once /5-seconds and factored into a hit
> percentage. May stay up around 100% for several ticks but then go way
> down which may or not sustain.
>
> This is an OLTP app using Rails with hundreds of tables both trivial
> n structure as well as having partitions, large payloads... TOAST and
> the like.
>
> TPS can measure in the ~5-10k range.

That's cache hit rate, not coherency ;)

- Andres

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Scott Marlowe 2017-09-20 19:21:11 Re: Any known issues Pg 9.3 on Ubuntu Xenial kernel 4.4.0?
Previous Message Ken Tanzer 2017-09-20 18:47:00 Puzzled by UNION with unknown types