pgbench vs. wait events

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: pgbench vs. wait events
Date: 2016-10-06 18:38:56
Message-ID: CA+Tgmoav9Q5v5ZGT3+wP_1tQjT6TGYXrwrDcTRrWimC+ZY7RRA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I decided to do some testing on hydra (IBM-provided community
resource, POWER, 16 cores/64 threads, kernel 3.2.6-3.fc16.ppc64) using
the newly-enhanced wait event stuff to try to get an idea of what
we're waiting for during pgbench. I did 30-minute pgbench runs with
various configurations, but all had max_connections = 200,
shared_buffers = 8GB, maintenance_work_mem = 4GB, synchronous_commit =
off, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9,
log_line_prefix = '%t [%p] ', max_wal_size = 40GB, log_checkpoints =
on. During each run, I ran this psql script in another window and
captured the output:

\t
select wait_event_type, wait_event from pg_stat_activity where pid !=
pg_backend_pid()
\watch 0.5

Then, I used a little shell-scripting to count up the number of times
each wait event occurred in the output. First, I tried scale factor
3000 with 32 clients and got these results:

1 LWLockTranche | buffer_mapping
9 LWLockNamed | CLogControlLock
14 LWLockNamed | ProcArrayLock
16 Lock | tuple
25 LWLockNamed | CheckpointerCommLock
49 LWLockNamed | WALBufMappingLock
122 LWLockTranche | clog
182 Lock | transactionid
287 LWLockNamed | XidGenLock
1300 Client | ClientRead
1375 LWLockTranche | buffer_content
3990 Lock | extend
21014 LWLockNamed | WALWriteLock
28497 |
58279 LWLockTranche | wal_insert

tps = 1150.803133 (including connections establishing)

What jumps out here is, at least to me, is that there is furious
contention on both the wal_insert locks and on WALWriteLock.
Apparently, the system simply can't get WAL on disk fast enough to
keep up with this workload. Relation extension locks and
buffer_content locks also are also pretty common, both ahead of
ClientRead, a relatively uncommon wait event on this test. The load
average on the system was only about 3 during this test, indicating
that most processes are in fact spending most of their time off-CPU.
The first thing I tried was switching to unlogged tables, which
produces these results:

1 BufferPin | BufferPin
1 LWLockTranche | lock_manager
2 LWLockTranche | buffer_mapping
8 LWLockNamed | ProcArrayLock
9 LWLockNamed | CheckpointerCommLock
9 LWLockNamed | CLogControlLock
11 LWLockTranche | buffer_content
37 LWLockTranche | clog
153 Lock | tuple
388 LWLockNamed | XidGenLock
827 Lock | transactionid
1267 Client | ClientRead
20631 Lock | extend
91767 |

tps = 1223.239416 (including connections establishing)

If you don't look at the TPS number, these results look like a vast
improvement. The overall amount of time spent not waiting for
anything is now much higher, and the problematic locks have largely
disappeared from the picture. However, the load average now shoots up
to about 30, because most of the time that the backends are "not
waiting for anything" they are in fact in kernel wait state D; that
is, they're stuck doing I/O. This suggests that we might want to
consider advertising a wait state when a backend is doing I/O, so we
can measure this sort of thing.

Next, I tried lowering the scale factor to something that fits in
shared buffers. Here are the results at scale factor 300:

14 Lock | tuple
22 LWLockTranche | lock_manager
39 LWLockNamed | WALBufMappingLock
331 LWLockNamed | CLogControlLock
461 LWLockNamed | ProcArrayLock
536 Lock | transactionid
552 Lock | extend
716 LWLockTranche | buffer_content
763 LWLockNamed | XidGenLock
2113 LWLockNamed | WALWriteLock
6190 LWLockTranche | wal_insert
25002 Client | ClientRead
78466 |

tps = 27651.562835 (including connections establishing)

Obviously, there's a vast increase in TPS, and the backends seem to
spend most of their time actually doing work. ClientRead is now the
overwhelmingly dominant wait event, although wal_insert and
WALWriteLock contention is clearly still a significant problem.
Contention on other locks is apparently quite rare. Notice that
client reads are really significant here - more than 20% of the time
we sample what a backend is doing, it's waiting for the next query.
It seems unlikely that performance on this workload can be improved
very much by optimizing anything other than WAL writing, because no
other wait event appears in more than 1% of the samples. It's not
clear how much of the WAL-related stuff is artificial lock contention
and how much of it is the limited speed at which the disk can rotate.

However, I was curious about what's going on with CLogControlLock,
persuant to the other thread on that topic, so I reran this with
unlogged tables. At scale factor 300, 32 clients, unlogged tables, I
get these results:

6 LWLockTranche | wal_insert
12 Lock | tuple
26 LWLockTranche | lock_manager
39 LWLockTranche | buffer_content
353 LWLockNamed | CLogControlLock
377 Lock | extend
491 Lock | transactionid
500 LWLockNamed | ProcArrayLock
696 LWLockNamed | XidGenLock
27685 Client | ClientRead
84930 |

WAL contention is eliminated, and CLog contention doesn't increase.
Next, I tried ramping up the client count to see what effect that had.
These next three results are all at scale factor 300 with unlogged
tables.

64 clients:

1 BufferPin | BufferPin
2 LWLockTranche | buffer_mapping
51 LWLockTranche | wal_insert
52 Lock | relation
132 LWLockTranche | lock_manager
154 LWLockTranche | buffer_content
209 Lock | tuple
940 Lock | extend
1412 LWLockNamed | XidGenLock
2337 LWLockNamed | CLogControlLock
2813 LWLockNamed | ProcArrayLock
3859 Lock | transactionid
83760 Client | ClientRead
133840 |

tps = 36672.484524 (including connections establishing)

96 clients:

1 LWLockTranche | proc
53 LWLockTranche | wal_insert
305 LWLockTranche | buffer_content
391 LWLockTranche | lock_manager
797 Lock | tuple
1074 Lock | extend
2492 LWLockNamed | XidGenLock
4185 LWLockNamed | ProcArrayLock
6003 LWLockNamed | CLogControlLock
11348 Lock | transactionid
142881 Client | ClientRead
174011 |

tps = 37214.257661 (including connections establishing)

192 clients:

1 LWLockTranche | buffer_mapping
2 BufferPin | BufferPin
4 LWLockTranche | clog
8 LWLockTranche | proc
87 LWLockTranche | wal_insert
708 LWLockTranche | buffer_content
1892 LWLockTranche | lock_manager
2225 Lock | extend
4796 LWLockNamed | ProcArrayLock
5105 LWLockNamed | XidGenLock
7982 Lock | tuple
11882 LWLockNamed | CLogControlLock
64091 Lock | transactionid
275481 |
308627 Client | ClientRead

tps = 35018.906863 (including connections establishing)

There are a couple of interesting things here. First, CLogControlLock
contention becomes more significant as the number of the clients goes
up. At 32 clients, it is the 6th-most frequent wait event; at 192
clients, it's moved up to 4th, and it's clearly growing at a faster
rate than some of the others. As we go from 32 to 64 to 96 to 192
clients, ProcArrayLock is captured in 500, 2813, 4185, 4796 samples,
but CLogControLock is captured in 353, 2337, 6003, 11882 samples, so
it goes from being only 2/3 of ProcArrayLock to being more than twice
ProcArrayLock. However, in absolute terms, it's still not very
significant. Even at 192 clients, where it's the fourth-most-common
wait event, it appears in only 1.7% of samples, which means backends
are only waiting for it about 1.7% of the time. It appears that, on
this system, this workload is NEVER going to get meaningfully faster
by improving CLogControlLock.

Second, ClientRead becomes a bigger and bigger issue as the number of
clients increases; by 192 clients, it appears in 45% of the samples.
That basically means that pgbench is increasingly unable to keep up
with the server; for whatever reason, it suffers more than the server
does from the increasing lack of CPU resources. Third,
Lock/transactionid and Lock/tuple become more and more common as the
number of clients increases; these happen when two different pgbench
threads deciding to hit the same row at the same time. Due to the
birthday paradox this increases pretty quickly as the number of
clients ramps up, but it's not really a server issue: there's nothing
the server can do about the possibility that two or more clients pick
the same random number at the same time.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2016-10-06 18:55:55 Re: Switch to unnamed POSIX semaphores as our preferred sema code?
Previous Message marllius ribeiro 2016-10-06 18:28:45 Re: Complete LOCK TABLE ... IN ACCESS|ROW|SHARE