aarch64 build uses very slow assembly, it is fixable

From: Daniel Farina <daniel(at)fdr(dot)io>
To: pgsql-pkg-yum(at)postgresql(dot)org
Subject: aarch64 build uses very slow assembly, it is fixable
Date: 2020-10-18 06:46:49
Message-ID: CACN56+P1astF5zvocrT7--Mu2dQWFS0eQ31xNmX=b=98y9fMSw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-pkg-yum

So, I was microbenchmarking on AWS Graviton 2 a.k.a. Neoverse N1-ish
processors (on instances c6g, m6g) and noticed that TPS was sensitive
to the number of clients and dropping to low throughputs, particularly
around 32 clients:

m6g.16xlarge, scale factor 1, select-only, 386.7K TPS, pgbench -S -j 4
--time=60 --client=34
m6g.16xlarge, scale factor 1, select-only, 596.3K TPS, pgbench -S -j 4
--time=60 --client=33
m6g.16xlarge, scale factor 1, select-only, 670.5K TPS, pgbench -S -j 4
--time=60 --client=32
m6g.16xlarge, scale factor 1, select-only, 641.4K TPS, pgbench -S -j 4
--time=60 --client=30

If you increase clients more, this can decrease to 145K TPS, or worse,
and all bulk of all time is spent in LWLockAcquire.

This email https://www.postgresql.org/message-id/099F69EE-51D3-4214-934A-1F28C0A1A7A7@amazon.com
reports some weaknesses in generated instructions for aarch64, but it
is does not relate an improvement of this magnitude...but it is: I can
get 591K TPS even with 100 clients once using "casal" by a number of
means, instead of 145K. It doesn't improve the best case by so much,
but it degrades far more gracefully while offering more throughput.

In the profiler, the difference between using "casal" and
"ldaxr"/"stlxr" is whether postgres spends a majority of its time in
snapshot acquisition or barely any, and how steep the ramp of
degradation of more connections is. Once the atomic stuff is out of
the way, much more difficult memory loads and stores in the planner
are the new bottleneck...a pretty big improvement.

Okay, so the gains are very great. How do we get these instructions emitted?

One option is to compile with -march=armv8.2-a. This works on older
compilers, but will break the code for an ARM chip without the right
features. This is how I started experimenting.

Another, to use -moutline-atomics, as the previous mailing list post
mentions. This is available in newer GCCs than I can easily get on
CentOS 8. CentOS 8 comes with 8.3.1, and gcc-toolset-9 loads 9.2.1,
which also doesn't include it...per
https://gcc.gnu.org/gcc-9/changes.html, gcc 9.4 is required. Here's
the patch introducing outline-atomics:

commit 3950b229a5ed6710f30241c2ddc3c74909bf4740
Author: Richard Henderson <richard(dot)henderson(at)linaro(dot)org>
Date: Thu Sep 19 14:36:43 2019 +0000

aarch64: Implement -moutline-atomics

More recently, -moutline-atomics became the default:

commit cd4b68527988f42c10c0d6c10e812d299887e0c2
Author: Kyrylo Tkachov <kyrylo(dot)tkachov(at)arm(dot)com>
Date: Thu Apr 30 13:12:13 2020 +0100

[AArch64] Make -moutline-atomics on by default

Given I did not identify an easy way to obtain an rpm with any
compiler new enough to have -moutline-atomics on or off by default, I
compiled a new version of GCC, 10.2, and ran ./configure without
additional flags (save symbols, for disassembly). It works, and emits
a hybrid assembly that selects between "casal" and "ldaxr"/"stlxr".

I also attempted to measure any gains from using -mtune=neoverse-n1,
available in such new GCC, which not only avoids emitting the generic
atomics code, but changes various cost metrics and so on as well. This
was worth maybe 1% gain or less, and I hardly think the small
improvement is from avoiding a couple of branches around the CAS --
there's just too little time spent in that part of the program either
way.

So unfortunately, this does not leave fantastic options for generating
good code on CentOS, for lack of handy GCC versions. But I wanted to
let you know of these limitations in what's commonly available and
that the problem is solvable...and worth solving.

Included are disassemblies for reference.

With gcc 8, per normal CentOS:

Bad, very slow, no indirection, ldaxr"/"stlxr"

Disassembly of section .text:
0000000000897084 <pg_atomic_compare_exchange_u32_impl>:
pg_atomic_compare_exchange_u32_impl():
#if !defined(PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32) &&
defined(HAVE_GCC__ATOMIC_INT32_CAS)
#define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32
static inline bool
pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
uint32 *expected, uint32 newval)
{
0.00 sub sp, sp, #0x20
0.00 str x0, [sp, #24]
0.01 str x1, [sp, #16]
0.00 str w2, [sp, #12]
/* FIXME: we can probably use a lower consistency model */
return __atomic_compare_exchange_n(&ptr->value, expected,
newval, false,
0.00 ldr x0, [sp, #24]
0.02 ldr x1, [sp, #16]
0.01 ldr w1, [x1]
0.01 ldr w3, [sp, #12]
26.86 20: ldaxr w2, [x0]
72.04 cmp w2, w1
↓ b.ne 34
0.04 stlxr w4, w3, [x0]
0.04 ↑ cbnz w4, 20
0.89 34: cset w0, eq // eq = none
0.01 cmp w0, #0x0
0.00 ↓ b.ne 48
0.03 ldr x1, [sp, #16]
0.01 str w2, [x1]
__ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
}
0.01 48: add sp, sp, #0x20
← ret

Good, fast, no indirection, casal, -march=armv8.2-a on an older compiler:

Disassembly of section .text:
0000000000896d58 <pg_atomic_compare_exchange_u32_impl>:
pg_atomic_compare_exchange_u32_impl():
#if !defined(PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32) &&
defined(HAVE_GCC__ATOMIC_INT32_CAS)
#define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32
static inline bool
pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
uint32 *expected, uint32 newval)
{
0.16 sub sp, sp, #0x20
0.17 str x0, [sp, #24]
0.60 str x1, [sp, #16]
str w2, [sp, #12]
/* FIXME: we can probably use a lower consistency model */
return __atomic_compare_exchange_n(&ptr->value, expected,
newval, false,
0.33 ldr x0, [sp, #24]
1.27 ldr x1, [sp, #16]
0.61 ldr w1, [x1]
1.82 ldr w3, [sp, #12]
1.98 mov w2, w1
casal w2, w3, [x0]
87.18 cmp w2, w1
cset w0, eq // eq = none
cmp w0, #0x0
↓ b.ne 40
0.22 ldr x1, [sp, #16]
0.33 str w2, [x1]
__ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
}
5.23 40: add sp, sp, #0x20
0.11 ← ret

Okay, moving onto gcc 10.2 disassembly.

This is what outline-atomics looks like:

Disassembly of section .text:

00000000008ab5a4 <pg_atomic_compare_exchange_u32_impl>:
pg_atomic_compare_exchange_u32_impl():
#if !defined(PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32) &&
defined(HAVE_GCC__ATOMIC_INT32_CAS)
#define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32
static inline bool
pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
uint32 *expected, uint32 newval)
{
0.73 stp x29, x30, [sp, #-64]!
5.49 mov x29, sp
str x19, [sp, #16]
1.81 str x0, [sp, #56]
5.49 str x1, [sp, #48]
1.10 str w2, [sp, #44]
/* FIXME: we can probably use a lower consistency model */
return __atomic_compare_exchange_n(&ptr->value, expected,
newval, false,
ldr x1, [sp, #56]
6.58 ldr x0, [sp, #48]
3.29 ldr w19, [x0]
15.00 ldr w0, [sp, #44]
11.33 mov x2, x1
mov w1, w0
mov w0, w19
→ bl __aarch64_cas4_acq_rel
cmp w0, w19
mov w2, w0
cset w0, eq // eq = none
cmp w0, #0x0
↓ b.ne 54
2.56 ldr x1, [sp, #48]
3.29 str w2, [x1]
__ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
}
33.86 54: ldr x19, [sp, #16]
4.00 ldp x29, x30, [sp], #64
5.49 ← ret

and inside __aarch64_cas4_acq_rel:

Disassembly of section .text:

0000000000ab2d30 <__aarch64_cas4_acq_rel>:
__aarch64_cas4_acq_rel():
cbz w(tmp0), \label
.endm

#ifdef L_cas

STARTFN NAME(cas)
0.46 hint #0x22
JUMP_IF_NOT_LSE 8f
0.21 adrp x16, hist_entries+0x1f750
ldrb w16, [x16, #2260]
2.30 ↓ cbz w16, 18
# define CAS glue4(cas, A, L, S) s(0), s(1), [x2]
#else
# define CAS .inst 0x08a07c41 + B + M
#endif

CAS /* s(0), s(1), [x2] */
0.75 casal w0, w1, [x2]
ret
96.27 ← ret

8: UXT s(tmp0), s(0)
18: mov w16, w0
0: LDXR s(0), [x2]
1c: ldaxr w0, [x2]
cmp s(0), s(tmp0)
cmp w0, w16
bne 1f
↓ b.ne 30
STXR w(tmp1), s(1), [x2]
stlxr w17, w1, [x2]
cbnz w(tmp1), 0b
↑ cbnz w17, 1c
1: ret
30: ← ret

You can see the bad ldaxr code at the bottom, never executed.

Browse pgsql-pkg-yum by date

  From Date Subject
Next Message Devrim Gündüz 2020-10-21 17:02:36 Heads up: --sign is going away
Previous Message Regina Obe 2020-10-15 17:00:23 RE: Adding YUM packages to repository