Re: 60 core performance with 9.3

From: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: 60 core performance with 9.3
Date: 2014-07-30 01:44:54
Message-ID: 53D84E16.90609@catalyst.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On 17/07/14 11:58, Mark Kirkwood wrote:

>
> Trying out with numa_balancing=0 seemed to get essentially the same
> performance. Similarly wrapping postgres startup with --interleave.
>
> All this made me want to try with numa *really* disabled. So rebooted
> the box with "numa=off" appended to the kernel cmdline. Somewhat
> surprisingly (to me anyway), the numbers were essentially identical. The
> profile, however is quite different:
>

A little more tweaking got some further improvement:

rwlocks patch as before

wal_buffers = 256MB
checkpoint_segments = 1920
wal_sync_method = open_datasync

LSI RAID adaptor disable read ahead and write cache for SSD fast path mode
numa_balancing = 0

Pgbench scale 2000 again:

clients | tps (prev) | tps (tweaked config)
---------+------------+---------
6 | 8175 | 8281
12 | 14409 | 15896
24 | 17191 | 19522
48 | 23122 | 29776
96 | 22308 | 32352
192 | 23109 | 28804

Now recall we were seeing no actual tps changes with numa_balancing=0 or
1 (so the improvement above is from the other changes), but figured it
might be informative to try to track down what the non-numa bottlenecks
looked like. We tried profiling the entire 10 minute run which showed
the stats collector as a possible source of contention:

3.86% postgres [kernel.kallsyms] [k] _raw_spin_lock_bh
|
--- _raw_spin_lock_bh
|
|--95.78%-- lock_sock_nested
| udpv6_sendmsg
| inet_sendmsg
| sock_sendmsg
| SYSC_sendto
| sys_sendto
| tracesys
| __libc_send
| |
| |--99.17%-- pgstat_report_stat
| | PostgresMain
| | ServerLoop
| | PostmasterMain
| | main
| | __libc_start_main
| |
| |--0.77%-- pgstat_send_bgwriter
| | BackgroundWriterMain
| | AuxiliaryProcessMain
| | 0x7f08efe8d453
| | reaper
| | __restore_rt
| | PostmasterMain
| | main
| | __libc_start_main
| --0.07%-- [...]
|
|--2.54%-- __lock_sock
| |
| |--91.95%-- lock_sock_nested
| | udpv6_sendmsg
| | inet_sendmsg
| | sock_sendmsg
| | SYSC_sendto
| | sys_sendto
| | tracesys
| | __libc_send
| | |
| | |--99.73%-- pgstat_report_stat
| | | PostgresMain
| | | ServerLoop

Disabling track_counts and rerunning pgbench:

clients | tps (no counts)
---------+------------
6 | 9806
12 | 18000
24 | 29281
48 | 43703
96 | 54539
192 | 36114

While these numbers look great in the middle range (12-96 clients), then
benefit looks to be tailing off as client numbers increase. Also running
with no stats (and hence no auto vacuum or analyze) is way too scary!

Trying out less write heavy workloads shows that the stats overhead does
not appear to be significant for *read* heavy cases, so this result
above is perhaps more of a curiosity than anything (given that read
heavy is more typical...and our real workload is more similar to read
heavy).

The profile for counts off looks like:

4.79% swapper [kernel.kallsyms] [k] read_hpet
|
--- read_hpet
|
|--97.10%-- ktime_get
| |
| |--35.24%-- clockevents_program_event
| | tick_program_event
| | |
| | |--56.59%--
__hrtimer_start_range_ns
| | | |
| | | |--78.12%--
hrtimer_start_range_ns
| | | |
tick_nohz_restart
| | | |
tick_nohz_idle_exit
| | | |
cpu_startup_entry
| | | | |
| | | |
|--98.84%-- start_secondary
| | | | |
| | | |
--1.16%-- rest_init
| | | |
start_kernel
| | | |
x86_64_start_reservations
| | | |
x86_64_start_kernel
| | | |
| | | --21.88%--
hrtimer_start
| | |
tick_nohz_stop_sched_tick
| | |
__tick_nohz_idle_enter
| | | |
| | |
|--99.89%-- tick_nohz_idle_enter
| | | |
cpu_startup_entry
| | | |
|
| | | |
|--98.30%-- start_secondary
| | | |
|
| | | |
--1.70%-- rest_init
| | | |
start_kernel
| | | |
x86_64_start_reservations
| | | |
x86_64_start_kernel
| | |
--0.11%-- [...]
| | |
| | |--40.25%--
hrtimer_force_reprogram
| | | __remove_hrtimer
| | | |
| | | |--89.68%--
__hrtimer_start_range_ns
| | | |
hrtimer_start
| | | |
tick_nohz_stop_sched_tick
| | | |
__tick_nohz_idle_enter
| | | | |
| | | |
|--99.90%-- tick_nohz_idle_enter
| | | | |
cpu_startup_entry
| | | | |
|
| | | | |
|--99.04%-- start_secondary
| | | | |
|
| | | | |
--0.96%-- rest_init
| | | | |
start_kernel
| | | | |
x86_64_start_reservations
| | | | |
x86_64_start_kernel
| | | |
--0.10%-- [...]
| | | |

Any thoughts on how to proceed further appreciated!

Cheers,

Mark

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Josh Berkus 2014-07-30 03:12:32 Why you should turn on Checksums with SSDs
Previous Message Rural Hunter 2014-07-30 01:13:40 Re: Very slow planning performance on partition table