From: | Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com> |
---|---|
To: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Auto-vectorization speeds up multiplication of large-precision numerics |
Date: | 2020-06-09 11:50:25 |
Message-ID: | CAJ3gD9evtA_vBo+WMYMyT-u=keHX7-r8p2w7OSRfXf42LTwCZQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
There is this for loop in mul_var() :
/*
* Add the appropriate multiple of var2 into the accumulator.
*
* As above, digits of var2 can be ignored if they don't contribute,
* so we only include digits for which i1+i2+2 <= res_ndigits - 1.
*/
for (i2 = Min(var2ndigits - 1, res_ndigits - i1 - 3), i = i1 + i2 + 2;
i2 >= 0; i2--)
dig[i--] += var1digit * var2digits[i2];
With gcc -O3, the above for loop, if simplified, gets auto-vectorized
[1] ; and this results in speedups for multiplication of PostgreSQL
numeric types having large precisions. The speedups start becoming
noticeable from around 50 precision onwards. With 50 precision the
improvement I saw was 5%, with 60 11%, 120 50%, 240 2.2x, and so on.
On my arm64 machine, a similar benefit starts showing up from 20
precision onwards. I used this query from regress/sql/numeric_big.sql
:
SELECT t1.val * t2.val FROM num_data t1, num_data t2
If I use the schema created by numeric_big.sql, the speedup was 2.5x
to 2.7x across three machines.
Also, the regress/sql/numeric_big test itself speeds up by 80%
For the for loop to be auto-vectorized, I had to simplify it to
something like this :
i2 = Min(var2ndigits - 1, res_ndigits - i1 - 3);
digptr = &dig[i1 + 2];
for (i = 0; i <= i2; i++)
digptr[i] += var1digit * var2digits[i];
gcc also can vectorize backward loop such as this :
for (i = n-1; i >= 0; i--)
a += b[i];
gcc -fopt-info-all gives this info :
numeric.c:7217:3: optimized: loop vectorized using 16 byte vectors
But if the assignment is not as simple as above, it does not vectorize
the backward traversal :
i2 = Min(var2ndigits - 1, res_ndigits - i1 - 3);
digptr = &dig[i1 + i2 + 2];
for (; i2 >= 0; i2--)
digptr[i2] += var1digit * var2digits[i2];
numeric.c:7380:3: missed: couldn't vectorize loop
numeric.c:7381:15: missed: not vectorized: relevant stmt not
supported: _39 = *_38;
Even for forward loop traversal, the below can't be vectorized
seemingly because it involves two variables :
count = Min(var2ndigits - 1, res_ndigits - i1 - 3) + 1;
i = i1 + i2 - count + 3;
for (i2 = 0; i2 < count; i++, i2++)
dig[i] += var1digit * var2digits[i2];
numeric.c:7394:3: missed: couldn't vectorize loop
numeric.c:7395:11: missed: not vectorized: not suitable for gather
load _37 = *_36;
So it's better to keep the loop simple :
i2 = Min(var2ndigits - 1, res_ndigits - i1 - 3);
digptr = &dig[i1 + 2];
for (i = 0; i <= i2; i++)
digptr[i] += var1digit * var2digits[i];
numeric.c:7387:3: optimized: loop vectorized using 16 byte vectors
Attached is the patch that uses the above loop.
With the patch, in mul_var() assembly code, I could see the
multiply-accumulate instructions that operate on SIMD vectors (these
are arm64 instructions) :
smlal v1.4s, v2.4h, v3.4h
smlal2 v0.4s, v2.8h, v3.8h
I extracted the "SELECT t1.val * t2.val FROM num_data t1, num_data
t2" query from regress/sql/numeric_big.sql, and ran it on the data
that the test creates (it inserts values with precisions ranging from
500 to 700). Attached is create_schema.sql which creates the
regression test schema.
With this query, below are the changes in mul_var() figures with and
without patch :
(All the below figures are with -O3 build.)
HEAD :
+ 64.06% postgres postgres [.] mul_var
+ 13.00% postgres postgres [.] get_str_from_var
+ 6.32% postgres [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
+ 1.65% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string
+ 1.10% postgres [kernel.kallsyms] [k] _raw_spin_lock
+ 0.96% postgres [kernel.kallsyms] [k] get_page_from_freelist
+ 0.73% postgres [kernel.kallsyms] [k] page_counter_try_charge
+ 0.64% postgres postgres [.] AllocSetAlloc
Patched :
+ 35.91% postgres postgres [.] mul_var
+ 20.43% postgres postgres [.] get_str_from_var
+ 13.01% postgres [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
+ 2.31% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string
+ 1.48% postgres [kernel.kallsyms] [k] _raw_spin_lock
+ 1.15% postgres [kernel.kallsyms] [k] get_page_from_freelist
+ 0.99% postgres postgres [.] AllocSetAlloc
+ 0.58% postgres postgres [.] base_yyparse
Times in milliseconds for SELECT t1.val * t2.val FROM num_data t1,
num_data t2 :
Machine 1 (amd64)
Head : .668 .723 .658 .660
Patched : .288 .280 .282 .282
Machine 2 (arm64)
Head : .897 .879 .888 .897
Patched : .329 .324 .321 .320
Average times in milliseconds for numeric_big regression test :
Machine 1 (amd64)
Head : 801
Patched : 445
Machine 2 (arm64)
Head : 1105
Patched : 550
gcc -O3 option :
I understand we have kept the default gcc CFLAGS to -O2, because, I
believe, we might enable some bugs due to some assumptions in the
code, if we make it -O3. But with this patch, we allow products built
with -O3 flag to get this benefit.
The actual gcc option to enable auto-vectorization is
-ftree-loop-vectorize. But for -O3 it is always true. What we can do
in the future is to have a separate file that has such optimized code
that is proven to work with such optimization flags, and enable the
required compiler flags only for such files, if the build is done with
-O2.
[1] https://gcc.gnu.org/projects/tree-ssa/vectorization.html#using
--
Thanks,
-Amit Khandekar
Huawei Technologies
Attachment | Content-Type | Size |
---|---|---|
create_schema.sql | application/sql | 3.6 KB |
0001-Auto-vectorize-loop-to-speedup-large-precision-numer.patch | text/x-patch | 2.2 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Joel Mariadasan (jomariad) | 2020-06-09 11:58:33 | Postgres installer with pgAdmin 4.22 |
Previous Message | Peter Eisentraut | 2020-06-09 11:36:58 | text coverage for EXTRACT() |