From: | Sokolov Yura <funny(dot)falcon(at)postgrespro(dot)ru> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Fix performance of generic atomics |
Date: | 2017-05-25 13:39:22 |
Message-ID: | 9fccff0670a2ec3c031d459564892f42@postgrespro.ru |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
A bit cleaner version of a patch.
Sokolov Yura писал 2017-05-25 15:22:
> Good day, everyone.
>
> I've been played with pgbench on huge machine.
> (72 cores, 56 for postgresql, enough memory to fit base
> both into shared_buffers and file cache)
> (pgbench scale 500, unlogged tables, fsync=off,
> synchronous commit=off, wal_writer_flush_after=0).
>
> With 200 clients performance is around 76000tps and main
> bottleneck in this dumb test is LWLockWaitListLock.
>
> I added gcc specific implementation for pg_atomic_fetch_or_u32_impl
> (ie using __sync_fetch_and_or) and performance became 83000tps.
>
> It were a bit strange at a first look, cause __sync_fetch_and_or
> compiles to almost same CAS loop.
>
> Looking closely, I noticed that intrinsic performs doesn't do
> read in the loop body, but at loop initialization. It is correct
> behavior cause `lock cmpxchg` instruction stores old value in EAX
> register.
>
> It is expected behavior, and pg_compare_and_exchange_*_impl does
> the same in all implementations. So there is no need to re-read
> value in the loop body:
>
> Example diff for pg_atomic_exchange_u32_impl:
>
> static inline uint32
> pg_atomic_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, uint32
> xchg_)
> {
> uint32 old;
> + old = pg_atomic_read_u32_impl(ptr);
> while (true)
> {
> - old = pg_atomic_read_u32_impl(ptr);
> if (pg_atomic_compare_exchange_u32_impl(ptr, &old, xchg_))
> break;
> }
> return old;
> }
>
> After applying this change to all generic atomic functions
> (and for pg_atomic_fetch_or_u32_impl ), performance became
> equal to __sync_fetch_and_or intrinsic.
>
> Attached patch contains patch for all generic atomic
> functions, and also __sync_fetch_and_(or|and) for gcc, cause
> I believe GCC optimize code around intrinsic better than
> around inline assembler.
> (final performance is around 86000tps, but difference between
> 83000tps and 86000tps is not so obvious in NUMA system).
>
> With regards,
--
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company
Attachment | Content-Type | Size |
---|---|---|
0001-Fix-performance-of-Atomics-generic-implementation.patch | text/x-diff | 5.7 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | tushar | 2017-05-25 13:43:46 | No parameter values checking while creating Alter subscription...Connection |
Previous Message | Michael Paquier | 2017-05-25 13:32:21 | Re: Server ignores contents of SASLInitialResponse |