Re: Speed up Clog Access by increasing CLOG buffers

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-10-15 07:43:03
Message-ID: CAA4eK1J9VxJUnpOiQDf0O=Z87QUMbw=uGcQr4EaGbHSCibx9yA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Oct 13, 2016 at 7:53 AM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> On 10/12/2016 08:55 PM, Robert Haas wrote:
>> On Wed, Oct 12, 2016 at 3:21 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>>> I think at higher client count from client count 96 onwards contention
>>> on CLogControlLock is clearly visible and which is completely solved
>>> with group lock patch.
>>>
>>> And at lower client count 32,64 contention on CLogControlLock is not
>>> significant hence we can not see any gain with group lock patch.
>>> (though we can see some contention on CLogControlLock is reduced at 64
>>> clients.)
>>
>> I agree with these conclusions. I had a chance to talk with Andres
>> this morning at Postgres Vision and based on that conversation I'd
>> like to suggest a couple of additional tests:
>>
>> 1. Repeat this test on x86. In particular, I think you should test on
>> the EnterpriseDB server cthulhu, which is an 8-socket x86 server.
>>
>> 2. Repeat this test with a mixed read-write workload, like -b
>> tpcb-like(at)1 -b select-only(at)9
>>
>
> FWIW, I'm already running similar benchmarks on an x86 machine with 72
> cores (144 with HT). It's "just" a 4-socket system, but the results I
> got so far seem quite interesting. The tooling and results (pushed
> incrementally) are available here:
>
> https://bitbucket.org/tvondra/hp05-results/overview
>
> The tooling is completely automated, and it also collects various stats,
> like for example the wait event. So perhaps we could simply run it on
> ctulhu and get comparable results, and also more thorough data sets than
> just snippets posted to the list?
>
> There's also a bunch of reports for the 5 already completed runs
>
> - dilip-300-logged-sync
> - dilip-300-unlogged-sync
> - pgbench-300-logged-sync-skip
> - pgbench-300-unlogged-sync-noskip
> - pgbench-300-unlogged-sync-skip
>
> The name identifies the workload type, scale and whether the tables are
> wal-logged (for pgbench the "skip" means "-N" while "noskip" does
> regular pgbench).
>
> For example the "reports/wait-events-count-patches.txt" compares the
> wait even stats with different patches applied (and master):
>
> https://bitbucket.org/tvondra/hp05-results/src/506d0bee9e6557b015a31d72f6c3506e3f198c17/reports/wait-events-count-patches.txt?at=master&fileviewer=file-view-default
>
> and average tps (from 3 runs, 5 minutes each):
>
> https://bitbucket.org/tvondra/hp05-results/src/506d0bee9e6557b015a31d72f6c3506e3f198c17/reports/tps-avg-patches.txt?at=master&fileviewer=file-view-default
>
> There are certainly interesting bits. For example while the "logged"
> case is dominated y WALWriteLock for most client counts, for large
> client counts that's no longer true.
>
> Consider for example dilip-300-logged-sync results with 216 clients:
>
> wait_event | master | gran_lock | no_cont_lock | group_upd
> --------------------+---------+-----------+--------------+-----------
> CLogControlLock | 624566 | 474261 | 458599 | 225338
> WALWriteLock | 431106 | 623142 | 619596 | 699224
> | 331542 | 358220 | 371393 | 537076
> buffer_content | 261308 | 134764 | 138664 | 102057
> ClientRead | 59826 | 100883 | 103609 | 118379
> transactionid | 26966 | 23155 | 23815 | 31700
> ProcArrayLock | 3967 | 3852 | 4070 | 4576
> wal_insert | 3948 | 10430 | 9513 | 12079
> clog | 1710 | 4006 | 2443 | 925
> XidGenLock | 1689 | 3785 | 4229 | 3539
> tuple | 965 | 617 | 655 | 840
> lock_manager | 300 | 571 | 619 | 802
> WALBufMappingLock | 168 | 140 | 158 | 147
> SubtransControlLock | 60 | 115 | 124 | 105
>
> Clearly, CLOG is an issue here, and it's (slightly) improved by all the
> patches (group_update performing the best). And with 288 clients (which
> is 2x the number of virtual cores in the machine, so not entirely crazy)
> you get this:
>
> wait_event | master | gran_lock | no_cont_lock | group_upd
> --------------------+---------+-----------+--------------+-----------
> CLogControlLock | 901670 | 736822 | 728823 | 398111
> buffer_content | 492637 | 318129 | 319251 | 270416
> WALWriteLock | 414371 | 593804 | 589809 | 656613
> | 380344 | 452936 | 470178 | 745790
> ClientRead | 60261 | 111367 | 111391 | 126151
> transactionid | 43627 | 34585 | 35464 | 48679
> wal_insert | 5423 | 29323 | 25898 | 30191
> ProcArrayLock | 4379 | 3918 | 4006 | 4582
> clog | 2952 | 9135 | 5304 | 2514
> XidGenLock | 2182 | 9488 | 8894 | 8595
> tuple | 2176 | 1288 | 1409 | 1821
> lock_manager | 323 | 797 | 827 | 1006
> WALBufMappingLock | 124 | 124 | 146 | 206
> SubtransControlLock | 85 | 146 | 170 | 120
>
> So even buffer_content gets ahead of the WALWriteLock. I wonder whether
> this might be because of only having 128 buffers for clog pages, causing
> contention on this system (surely, systems with 144 cores were not that
> common when the 128 limit was introduced).
>

Not sure, but I have checked if we increase clog buffers greater than
128, then it causes dip in performance on read-write workload in some
cases. Apart from that from above results, it is quite clear that
patches help in significantly reducing the CLOGControlLock contention
with group-update patch consistently better, probably because with
this workload is more contended on writing the transaction status.

> So the patch has positive impact even with WAL, as illustrated by tps
> improvements (for large client counts):
>
> clients | master | gran_locking | no_content_lock | group_update
> ---------+--------+--------------+-----------------+--------------
> 36 | 39725 | 39627 | 41203 | 39763
> 72 | 70533 | 65795 | 65602 | 66195
> 108 | 81664 | 87415 | 86896 | 87199
> 144 | 68950 | 98054 | 98266 | 102834
> 180 | 105741 | 109827 | 109201 | 113911
> 216 | 62789 | 92193 | 90586 | 98995
> 252 | 94243 | 102368 | 100663 | 107515
> 288 | 57895 | 83608 | 82556 | 91738
>
> I find the tps fluctuation intriguing, and I'd like to see that fixed
> before committing any of the patches.
>

I have checked the wait event results where there is more fluctuation:

test | clients | wait_event_type | wait_event |
master | granular_locking | no_content_lock | group_update
----------------------------------+---------+-----------------+---------------------+---------+------------------+-----------------+--------------
dilip-300-unlogged-sync | 108 | LWLockNamed |
CLogControlLock | 343526 | 502127 | 479937 |
301381
dilip-300-unlogged-sync | 180 | LWLockNamed |
CLogControlLock | 557639 | 835567 | 795403 |
512707

So, if I read above results correctly, then it shows that group-update
has helped slightly to reduce the contention and one probable reason
could be that we need to update clog status on different clog pages
more frequently on such a workload and may be need to perform disk
page reads for clog pages as well, so the benefit of grouping will
certainly be less. This is because page read requests will get
serialized and only leader backend needs to perform all such requests.
Robert has pointed about somewhat similar case upthread [1] and I have
modified the patch as well to use multiple slots (groups) for group
transaction status update [2], but we didn't pursued, because on
pgbench workload, it didn't showed any benefit. However, may be here
it can show some benefit, if we could make above results reproducible
and you guys think that above theory sounds reasonable, then I can
again modify the patch based on that idea.

Now, the story with granular_locking and no_content_lock patches seems
to be worse, because they seem to be increasing the contention on
CLOGControlLock rather than reducing it. I think one of the probable
reasons that could happen for both the approaches is that it
frequently needs to release the CLogControlLock acquired in Shared
mode and reacquire it in Exclusive mode as the clog page to modify is
not in buffer (different clog page update then the currently in
buffer) and then once again it needs to release the CLogControlLock
lock to read the clog page from disk and acquire it again in Exclusive
mode. This frequent release-acquire of CLOGControlLock in different
modes could lead to significant increase in contention. It is
slightly more for granular_locking patch as it needs one additional
lock (buffer_content_lock) in Exclusive mode after acquiring
CLogControlLock. Offhand, I could not see a way to reduce the
contention with granular_locking and no_content_lock patches.

So, the crux is that we are seeing more variability in some of the
results because of frequent different clog page accesses which is not
so easy to predict, but I think it is possible with ~100,000 tps.

>
> There's certainly much more interesting stuff in the results, but I
> don't have time for more thorough analysis now - I only intended to do
> some "quick benchmarking" on the patch, and I've already spent days on
> this, and I have other things to do.
>

Thanks a ton for doing such a detailed testing.

[1] - https://www.postgresql.org/message-id/CA%2BTgmoahCx6XgprR%3Dp5%3D%3DcF0g9uhSHsJxVdWdUEHN9H2Mv0gkw%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAA4eK1%2BSoW3FBrdZV%2B3m34uCByK3DMPy_9QQs34yvN8spByzyA%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2016-10-15 13:26:42 Re: Password identifiers, protocol aging and SCRAM protocol
Previous Message Peter Geoghegan 2016-10-14 23:56:39 Re: amcheck (B-Tree integrity checking tool)