Re: Fix performance degradation of contended LWLock on NUMA

From: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
To: Sokolov Yura <funny(dot)falcon(at)postgrespro(dot)ru>
Cc: pgsql-hackers(at)postgresql(dot)org, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Fix performance degradation of contended LWLock on NUMA
Date: 2017-09-11 15:01:14
Message-ID: 8e1dce77-6d4c-cfdb-00e0-8e9b3321aaba@redhat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 09/08/2017 03:35 PM, Sokolov Yura wrote:
>> I'm seeing
>>
>> -M prepared: Up to 11% improvement
>> -M prepared -S: No improvement, no regression ("noise")
>> -M prepared -N: Up to 12% improvement
>>
>> for all runs the improvement shows up the closer you get to the number
>> of CPU threads, or above. Although I'm not seeing the same
>> improvements as you on very large client counts there are definitely
>> improvements :)
>
> It is expected:
> - patch "fixes NUMA": for example, it doesn't give improvement on 1 socket
>   at all (I've tested it using numactl to bind to 1 socket)
> - and certainly it gives less improvement on 2 sockets than on 4 sockets
>   (and 28 cores vs 72 cores also gives difference),
> - one of hot points were CLogControlLock, and it were fixed with
>   "Group mode for CLOG updates" [1]
>

I'm planning to re-test that patch.

>>
>> +static inline bool
>> +LWLockAttemptLockOrQueueSelf(LWLock *lock, LWLockMode mode,
>> LWLockMode waitmode)
>>
>> I'll leave it to the Committer to decide if this method is too big to
>> be "inline".
>
> GCC 4.9 doesn't want to inline it without directive, and function call
> is then remarkable in profile.
>
> Attach contains version with all suggestions applied except remove of
> "inline".
>

Yes, ideally the method will be kept at "inline".

>> Open questions:
>> ---------------
>> * spins_per_delay as extern
>> * Calculation of skip_wait_list
>
> Currently calculation of skip_wait_list is mostly empirical (ie best
> i measured).
>

Ok, good to know.

> I strongly think that instead of spins_per_delay something dependent
> on concrete lock should be used. I tried to store it in a LWLock
> itself, but it were worse.

Yes, LWLock should be kept as small as possible, and cache line aligned
due to the cache storms, as shown by perf c2c.

> Recently I understand it should be stored in array indexed by tranche,
> but I didn't implement it yet, and therefore didn't measure.
>

Different constants for the LWLock could have an impact, but the
constants would also be dependent on machine setup, and work load.

Thanks for working on this !

Best regards,
Jesper

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2017-09-11 15:17:09 Re: The case for removing replacement selection sort
Previous Message Tom Lane 2017-09-11 14:55:12 Re: Still another race condition in recovery TAP tests