From: | Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com> |
---|---|
To: | Sokolov Yura <funny(dot)falcon(at)postgrespro(dot)ru> |
Cc: | pgsql-hackers(at)postgresql(dot)org, Andres Freund <andres(at)anarazel(dot)de> |
Subject: | Re: Fix performance degradation of contended LWLock on NUMA |
Date: | 2017-09-11 15:01:14 |
Message-ID: | 8e1dce77-6d4c-cfdb-00e0-8e9b3321aaba@redhat.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 09/08/2017 03:35 PM, Sokolov Yura wrote:
>> I'm seeing
>>
>> -M prepared: Up to 11% improvement
>> -M prepared -S: No improvement, no regression ("noise")
>> -M prepared -N: Up to 12% improvement
>>
>> for all runs the improvement shows up the closer you get to the number
>> of CPU threads, or above. Although I'm not seeing the same
>> improvements as you on very large client counts there are definitely
>> improvements :)
>
> It is expected:
> - patch "fixes NUMA": for example, it doesn't give improvement on 1 socket
> at all (I've tested it using numactl to bind to 1 socket)
> - and certainly it gives less improvement on 2 sockets than on 4 sockets
> (and 28 cores vs 72 cores also gives difference),
> - one of hot points were CLogControlLock, and it were fixed with
> "Group mode for CLOG updates" [1]
>
I'm planning to re-test that patch.
>>
>> +static inline bool
>> +LWLockAttemptLockOrQueueSelf(LWLock *lock, LWLockMode mode,
>> LWLockMode waitmode)
>>
>> I'll leave it to the Committer to decide if this method is too big to
>> be "inline".
>
> GCC 4.9 doesn't want to inline it without directive, and function call
> is then remarkable in profile.
>
> Attach contains version with all suggestions applied except remove of
> "inline".
>
Yes, ideally the method will be kept at "inline".
>> Open questions:
>> ---------------
>> * spins_per_delay as extern
>> * Calculation of skip_wait_list
>
> Currently calculation of skip_wait_list is mostly empirical (ie best
> i measured).
>
Ok, good to know.
> I strongly think that instead of spins_per_delay something dependent
> on concrete lock should be used. I tried to store it in a LWLock
> itself, but it were worse.
Yes, LWLock should be kept as small as possible, and cache line aligned
due to the cache storms, as shown by perf c2c.
> Recently I understand it should be stored in array indexed by tranche,
> but I didn't implement it yet, and therefore didn't measure.
>
Different constants for the LWLock could have an impact, but the
constants would also be dependent on machine setup, and work load.
Thanks for working on this !
Best regards,
Jesper
From | Date | Subject | |
---|---|---|---|
Next Message | Tomas Vondra | 2017-09-11 15:17:09 | Re: The case for removing replacement selection sort |
Previous Message | Tom Lane | 2017-09-11 14:55:12 | Re: Still another race condition in recovery TAP tests |