From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Reduce ProcArrayLock contention |
Date: | 2015-07-25 04:42:03 |
Message-ID: | CAA4eK1JVwEpE8e+qz9tbF9HgFmJtj4qqDR3Vzu3VsDPP71H0QQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, Jul 24, 2015 at 4:26 PM, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
wrote:
>
>
>
> On Mon, Jun 29, 2015 at 8:57 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:
>>
>>
>>
>> pgbench setup
>> ------------------------
>> scale factor - 300
>> Data is on magnetic disk and WAL on ssd.
>> pgbench -M prepared tpc-b
>>
>> Head : commit 51d0fe5d
>> Patch -1 : group_xid_clearing_at_trans_end_rel_v1
>>
>>
>> Client Count/TPS18163264128
>> HEAD814609210899199262363617812
>> Patch-11086648311093199083122028237
>>
>> The graph for the data is attached.
>>
>
> Numbers look impressive and definitely shows that the idea is worth
pursuing. I tried patch on my laptop. Unfortunately, at least for 4 and 8
clients, I did not see any improvement.
>
I can't help in this because I think we need somewhat
bigger m/c to test the impact of patch.
> In fact, averages over 2 runs showed a slight 2-4% decline in the tps.
Having said that, there is no reason to disbelieve your numbers and no much
powerful machines, we might see the gains.
>
> BTW I ran the tests with, pgbench -s 10 -c 4 -T 300
>
I am not sure if this result is worth worrying to investigate as in
write tests (that too for short duration), such fluctuations can
occur and I think till we see complete results for multiple clients
(1, 4, 8 .. 64 or 128) (possible on some high end m/c), it is difficult
to draw any conclusion.
>
>> Points about performance data
>> ---------------------------------------------
>> 1. Gives good performance improvement at or greater than 64 clients
>> and give somewhat moderate improvement at lower client count. The
>> reason is that because the contention around ProcArrayLock is mainly
>> seen at higher client count. I have checked that at higher client-count,
>> it started behaving lockless (which means performance with patch is
>> equivivalent to if we just comment out ProcArrayLock in
>> ProcArrayEndTransaction()).
>
>
> Well, I am not entirely sure if thats a correct way of looking at it.
Sure, you would see less contention on the ProcArrayLock because the fact
is that there are far fewer backends trying to acquire it.
>
I was telling that fact even without my patch. Basically I have
tried by commenting ProcArrayLock in ProcArrayEndTransaction.
> But those who don't get the lock will sleep and hence the contention is
moved somewhere else, at least partially.
>
Sure, if contention is reduced at one place it will move
to next lock.
>>
>> 4. The gains are visible when the data fits in shared_buffers as for
other
>> workloads I/O starts dominating.
>
>
> Thats seems be perfectly expected.
>
>>
>> 5. I have seen that effect of Patch is much more visible if we keep
>> autovacuum = off (do manual vacuum after each run) and keep
>> wal_writer_delay to lower value (say 20ms).
>
>
> Do you know why that happens? Is it because the contention moves
somewhere else with autovacuum on?
>
No, autovacuum generates I/O due to which sometimes there
is more variation in Write tests.
> Regarding the design itself, I've an idea that may be we can create a
general purpose infrastructure to use this technique.
>
I think this could be beneficial if can comeup with
some clean interface.
> If its useful here, I'm sure there are other places where this can be
applied with similar effect.
>
I also think so.
> For example, how about adding an API such as LWLockDispatchWork(lock,
mode, function_ptr, data_ptr)? Here the data_ptr points to somewhere in
shared memory that the function_ptr can work on once lock is available. If
the lock is available in the requested mode then the function_ptr is
> executed with the given data_ptr and the function returns.
>
I can do something like that if others also agree with this new
API in LWLock series, but personally I don't think LWLock.c is
the right place to expose API for this work. Broadly the work
we are doing can be thought of below sub-tasks.
1. Advertise each backend's xid.
2. Push all backend's except one on global list.
3. wait till some-one wakes and check if the xid is cleared,
repeat untll the xid is clear
4. Acquire the lock
5. Pop all the backend's and clear each one's xid and used
their published xid to advance global latestCompleteXid.
6. Release Lock
7. Wake all the processes waiting for their xid to be cleared
and before waking mark that Xid of the backend is clear.
So among these only step 2 can be common among different
algorithms, other's need some work specific to each optimization.
Does any one else see a better way to provide a generic API, so
that it can be used for other places if required in future?
> If the lock is not available then the work is dispatched to some Q
(tracked on per-lock basis?) and the process goes to sleep. Whenever the
lock becomes available in the requested mode, the work is executed by some
other backedn and the primary process is woken up. This will most likely
> happen in the LWLockRelease() path when the last holder is about to give
up the lock so that it becomes available in the requested "mode".
>
I am not able to follow what you want to achieve with this,
Why is 'Q' better than the current process to perform the
work specific to whole group and does 'Q' also wait on the
current lock, if yes how?
I think this will over complicate the stuff without any real
benefit, atleast for this optimization.
>
> Regarding the patch, the compare-and-exchange function calls that you've
used would work only for 64-bit machines, right? You would need to use
equivalent 32-bit calls on a 32-bit machine.
>
I thought that internal API will automatically take care of it,
example for msvc it uses _InterlockedCompareExchange64
which if doesn't work on 32-bit systems or is not defined, then
we have to use 32-bit version, but I am not certain about
that fact.
Note - This patch requires some updation in
src/backend/access/transam/README.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2015-07-25 05:46:02 | Re: Supporting TAP tests with MSVC and Windows |
Previous Message | Robert Haas | 2015-07-25 02:22:14 | Re: MultiXact member wraparound protections are now enabled |