Re: Optimize mul_var() for var1ndigits >= 8

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Joel Jacobson <joel(at)compiler(dot)org>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Optimize mul_var() for var1ndigits >= 8
Date: 2024-07-28 19:18:07
Message-ID: CAEZATCViHMfZdnfxpVJzTf08FRkuyvBOvEG=RsV9TekUgOExGA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 12 Jul 2024 at 13:34, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> wrote:
>
> Then I tried compiling with -m32, and unfortunately this made the
> patch slower than HEAD for small inputs:
>
> -- var1ndigits1=5, var2ndigits2=5 [-m32, SIMD disabled]
> call rate=5.052332e+06 -- HEAD
> call rate=3.883459e+06 -- v2 patch
>
> -- var1ndigits1=6, var2ndigits2=6 [-m32, SIMD disabled]
> call rate=4.7221405e+06 -- HEAD
> call rate=3.7965685e+06 -- v2 patch
>
> -- var1ndigits1=7, var2ndigits2=7 [-m32, SIMD disabled]
> call rate=4.4638375e+06 -- HEAD
> call rate=3.39948e+06 -- v2 patch
>
> and it doesn't reach parity until around ndigits=26, which is
> disappointing. It does still get much faster for large inputs
>

I spent some more time hacking on this, to try to improve the
situation for 32-bit builds. One of the biggest factors was the 64-bit
division that is necessary during carry propagation, which is very
slow -- every compiler/platform I looked at on godbolt.org turns this
into a call to a slow function like __udivdi3(). However, since we are
dividing by a constant (NBASE^2), it can be done using the same
multiply-and-shift trick that compilers use in 64-bit builds, and that
significantly improves performance.

Unfortunately, in 32-bit builds, using 64-bit integers is still slower
for small inputs (below 20-30 NBASE digits), so in the end I figured
that it's best to retain the old 32-bit base-NBASE multiplication code
for small inputs, and only use 64-bit base-NBASE^2 multiplication when
the inputs are above a certain size threshold. Furthermore, since this
threshold is quite low, it's possible to make an additional
simplification: as long as the threshold is less than or equal to 42,
it can be shown that there is no chance of 32-bit integer overflow,
and so the "maxdig" tracking and renormalisation code is not needed.
Getting rid of that makes the outer multiplication loop very simple,
and makes quite a noticeable difference to the performance for inputs
below the threshold.

In 64-bit builds, doing 64-bit base-NBASE^2 multiplication is faster
for all inputs over 4 or 5 NBASE digits, so the threshold can be much
lower. However, for numbers near that threshold, it's a close thing,
so it makes sense to also extend mul_var_small() to cover 1-6 digit
inputs, since that gives a much larger gain for numbers of that size.
That's quite nice because it equates to inputs with up to 21-24
decimal digits, which I suspect are quite commonly used in practice.

One risk in having different thresholds in 32-bit and 64-bit builds is
that it opens up the possibility that the results from the
reduced-rscale computations used by inexact functions like ln() and
exp() might be be different (actually, this may already be a
possibility, due to div_var_fast()'s use of floating point arithmetic,
but that seems considerably less likely). In practice though, this
should be extremely unlikely, due to the fact that any difference
would have to propagate through MUL_GUARD_DIGITS NBASE digits (8
decimal digits), and it doesn't show up in any of the existing tests.
IMO a very small chance of different results on different platforms is
probably acceptable, but it wouldn't be acceptable to make the
threshold a runtime configurable parameter that could lead to
different results on the same platform.

Overall, this has turned out to be more code than I would have liked,
but I think it's worth it, because the performance gains are pretty
substantial across the board.

(Here, I'm comparing with REL_17_STABLE, so that the effect of
mul_var_short() for ndigits <= 6 can be seen.)

32-bit build
============

Balanced inputs:

ndigits1 | ndigits2 | PG17 rate | patch rate | % change
----------+----------+---------------+---------------+----------
1 | 1 | 5.973068e+06 | 6.873059e+06 | +15.07%
2 | 2 | 5.646598e+06 | 6.6456665e+06 | +17.69%
3 | 3 | 5.8176995e+06 | 7.0903175e+06 | +21.87%
4 | 4 | 5.545298e+06 | 6.9701605e+06 | +25.69%
5 | 5 | 5.2109155e+06 | 6.6406185e+06 | +27.44%
6 | 6 | 4.9276335e+06 | 6.478698e+06 | +31.48%
7 | 7 | 4.6595095e+06 | 4.8514485e+06 | +4.12%
8 | 8 | 4.898536e+06 | 5.1599975e+06 | +5.34%
9 | 9 | 4.537171e+06 | 4.830987e+06 | +6.48%
10 | 10 | 4.308139e+06 | 4.6029265e+06 | +6.84%
11 | 11 | 4.092612e+06 | 4.37966e+06 | +7.01%
12 | 12 | 3.9345035e+06 | 4.213998e+06 | +7.10%
13 | 13 | 3.7600162e+06 | 4.0115955e+06 | +6.69%
14 | 14 | 3.5959855e+06 | 3.8216508e+06 | +6.28%
15 | 15 | 3.3576732e+06 | 3.6070588e+06 | +7.43%
16 | 16 | 3.657139e+06 | 3.9067975e+06 | +6.83%
17 | 17 | 3.3601955e+06 | 3.5651722e+06 | +6.10%
18 | 18 | 3.1844472e+06 | 3.4542238e+06 | +8.47%
19 | 19 | 3.0286392e+06 | 3.2380662e+06 | +6.91%
20 | 20 | 2.9012185e+06 | 3.0496352e+06 | +5.12%
21 | 21 | 2.755444e+06 | 2.9508798e+06 | +7.09%
22 | 22 | 2.6263908e+06 | 2.8168945e+06 | +7.25%
23 | 23 | 2.5470438e+06 | 2.7056318e+06 | +6.23%
24 | 24 | 2.764418e+06 | 2.9597732e+06 | +7.07%
25 | 25 | 2.509954e+06 | 2.7368215e+06 | +9.04%
26 | 26 | 2.3699395e+06 | 2.565404e+06 | +8.25%
27 | 27 | 2.286344e+06 | 2.4400948e+06 | +6.72%
28 | 28 | 2.199087e+06 | 2.361374e+06 | +7.38%
29 | 29 | 2.1208148e+06 | 2.26925e+06 | +7.00%
30 | 30 | 2.0469475e+06 | 2.2127455e+06 | +8.10%
31 | 31 | 1.9973804e+06 | 2.3771615e+06 | +19.01%
32 | 32 | 2.1288205e+06 | 2.3166555e+06 | +8.82%
33 | 33 | 1.9876898e+06 | 2.1759028e+06 | +9.47%
34 | 34 | 1.8906434e+06 | 2.169534e+06 | +14.75%
35 | 35 | 1.8069352e+06 | 1.990085e+06 | +10.14%
36 | 36 | 1.7412926e+06 | 1.9940908e+06 | +14.52%
37 | 37 | 1.6956435e+06 | 1.8492525e+06 | +9.06%
38 | 38 | 1.6253338e+06 | 1.8493976e+06 | +13.79%
39 | 39 | 1.5734566e+06 | 1.9296996e+06 | +22.64%
40 | 40 | 1.6692021e+06 | 1.902438e+06 | +13.97%
50 | 50 | 1.1116885e+06 | 1.5319529e+06 | +37.80%
100 | 100 | 399552.38 | 722142.44 | +80.74%
250 | 250 | 81092.02 | 195967.67 | +141.66%
500 | 500 | 21654.633 | 58329.473 | +169.36%
1000 | 1000 | 5525.9775 | 16420.611 | +197.15%
2500 | 2500 | 907.98206 | 2825.7324 | +211.21%
5000 | 5000 | 230.26935 | 731.26105 | +217.57%
10000 | 10000 | 57.793922 | 179.12334 | +209.93%
16383 | 16383 | 21.446728 | 64.39028 | +200.23%

Unbalanced inputs:

ndigits1 | ndigits2 | PG17 rate | patch rate | % change
----------+----------+-----------+------------+----------
1 | 10000 | 42816.89 | 52843.01 | +23.42%
2 | 10000 | 41032.25 | 52111.957 | +27.00%
3 | 10000 | 39550.176 | 52262.477 | +32.14%
4 | 10000 | 38015.59 | 43962.535 | +15.64%
5 | 10000 | 36560.22 | 43736.305 | +19.63%
6 | 10000 | 35305.77 | 38204.2 | +8.21%
7 | 10000 | 33833.086 | 36533.387 | +7.98%
8 | 10000 | 32847.996 | 35774.715 | +8.91%
9 | 10000 | 31345.736 | 33831.926 | +7.93%
10 | 10000 | 30351.6 | 32715.969 | +7.79%
11 | 10000 | 29321.592 | 31478.398 | +7.36%
12 | 10000 | 28616.018 | 30861.885 | +7.85%
13 | 10000 | 28216.12 | 29510.95 | +4.59%
14 | 10000 | 27396.408 | 29077.73 | +6.14%
15 | 10000 | 26519.08 | 28235.08 | +6.47%
16 | 10000 | 25778.102 | 27538.908 | +6.83%
17 | 10000 | 26024.918 | 26677.498 | +2.51%
18 | 10000 | 25316.346 | 26729.395 | +5.58%
19 | 10000 | 24626.07 | 26076.979 | +5.89%
20 | 10000 | 23912.383 | 25709.967 | +7.52%
21 | 10000 | 23238.488 | 24761.57 | +6.55%
22 | 10000 | 22746.25 | 23925.934 | +5.19%
23 | 10000 | 22120.777 | 23442.34 | +5.97%
24 | 10000 | 21645.215 | 22771.193 | +5.20%
25 | 10000 | 21135.049 | 22185.893 | +4.97%
26 | 10000 | 20685.025 | 21831.74 | +5.54%
27 | 10000 | 20039.559 | 20854.793 | +4.07%
28 | 10000 | 19846.092 | 21072.758 | +6.18%
29 | 10000 | 19414.059 | 20289.414 | +4.51%
30 | 10000 | 18968.617 | 19774.797 | +4.25%
31 | 10000 | 18394.988 | 21307.074 | +15.83%
32 | 10000 | 18291.504 | 21349.691 | +16.72%
33 | 10000 | 17899.676 | 20885.299 | +16.68%
34 | 10000 | 17505.402 | 20620.105 | +17.79%
35 | 10000 | 17174.918 | 20383.594 | +18.68%
36 | 10000 | 16609.867 | 20342.18 | +22.47%
37 | 10000 | 16457.889 | 19953.91 | +21.24%
38 | 10000 | 15926.13 | 19783.203 | +24.22%
39 | 10000 | 15441.283 | 19288.654 | +24.92%
40 | 10000 | 15038.773 | 19415.52 | +29.10%
50 | 10000 | 11264.285 | 17608.648 | +56.32%
100 | 10000 | 6253.7637 | 11620.726 | +85.82%
250 | 10000 | 2696.207 | 5939.3857 | +120.29%
500 | 10000 | 1338.4141 | 3268.2004 | +144.18%
1000 | 10000 | 672.6717 | 1691.9614 | +151.53%
2500 | 10000 | 267.5996 | 708.20386 | +164.65%
5000 | 10000 | 131.50755 | 353.92822 | +169.13%

numeric_big regression test:

ok 224 - numeric_big 279 ms [PG17]
ok 224 - numeric_big 182 ms [v3 patch]

64-bit build
============

Balanced inputs:

ndigits1 | ndigits2 | PG17 rate | patch rate | % change
----------+----------+---------------+---------------+----------
1 | 1 | 8.507485e+06 | 9.53221e+06 | +12.04%
2 | 2 | 8.0975715e+06 | 9.431853e+06 | +16.48%
3 | 3 | 6.461359e+06 | 7.3669945e+06 | +14.02%
4 | 4 | 6.1728355e+06 | 7.152418e+06 | +15.87%
5 | 5 | 6.500831e+06 | 7.6977115e+06 | +18.41%
6 | 6 | 6.1784075e+06 | 7.3765005e+06 | +19.39%
7 | 7 | 5.90117e+06 | 6.2799965e+06 | +6.42%
8 | 8 | 5.9217105e+06 | 6.147141e+06 | +3.81%
9 | 9 | 5.477262e+06 | 5.9889475e+06 | +9.34%
10 | 10 | 5.2147235e+06 | 5.858963e+06 | +12.35%
11 | 11 | 4.882895e+06 | 5.6766675e+06 | +16.26%
12 | 12 | 4.61105e+06 | 5.559544e+06 | +20.57%
13 | 13 | 4.382494e+06 | 5.3770165e+06 | +22.69%
14 | 14 | 4.134509e+06 | 5.256462e+06 | +27.14%
15 | 15 | 3.7595882e+06 | 5.0751355e+06 | +34.99%
16 | 16 | 4.3353435e+06 | 4.970363e+06 | +14.65%
17 | 17 | 3.9258755e+06 | 4.935394e+06 | +25.71%
18 | 18 | 3.7562495e+06 | 4.8723875e+06 | +29.71%
19 | 19 | 3.4890312e+06 | 4.723648e+06 | +35.39%
20 | 20 | 3.289758e+06 | 4.6569305e+06 | +41.56%
21 | 21 | 3.103908e+06 | 4.4747755e+06 | +44.17%
22 | 22 | 2.9545148e+06 | 4.4227305e+06 | +49.69%
23 | 23 | 2.7975982e+06 | 4.5065035e+06 | +61.08%
24 | 24 | 3.2456168e+06 | 4.4578115e+06 | +37.35%
25 | 25 | 2.9515055e+06 | 4.0208335e+06 | +36.23%
26 | 26 | 2.8158568e+06 | 4.0437498e+06 | +43.61%
27 | 27 | 2.6376518e+06 | 3.8959785e+06 | +47.71%
28 | 28 | 2.5094318e+06 | 3.8648428e+06 | +54.01%
29 | 29 | 2.3714905e+06 | 3.67182e+06 | +54.83%
30 | 30 | 2.2456015e+06 | 3.6337285e+06 | +61.82%
31 | 31 | 2.169437e+06 | 3.7209152e+06 | +71.52%
32 | 32 | 2.5022498e+06 | 3.6609378e+06 | +46.31%
33 | 33 | 2.27133e+06 | 3.435459e+06 | +51.25%
34 | 34 | 2.1836462e+06 | 3.4042262e+06 | +55.90%
35 | 35 | 2.0753196e+06 | 3.2125422e+06 | +54.80%
36 | 36 | 1.9650498e+06 | 3.185525e+06 | +62.11%
37 | 37 | 1.8668318e+06 | 3.0366508e+06 | +62.66%
38 | 38 | 1.7678832e+06 | 3.014941e+06 | +70.54%
39 | 39 | 1.6764314e+06 | 3.1080448e+06 | +85.40%
40 | 40 | 1.9170026e+06 | 3.0942025e+06 | +61.41%
50 | 50 | 1.2242934e+06 | 2.3769868e+06 | +94.15%
100 | 100 | 401733.62 | 1.0854601e+06 | +170.19%
250 | 250 | 81861.45 | 249837.78 | +205.20%
500 | 500 | 21613.402 | 71239.04 | +229.61%
1000 | 1000 | 5551.617 | 18349.414 | +230.52%
2500 | 2500 | 906.501 | 3107.6462 | +242.82%
5000 | 5000 | 231.65045 | 794.86444 | +243.13%
10000 | 10000 | 58.372395 | 188.37387 | +222.71%
16383 | 16383 | 21.629467 | 73.58552 | +240.21%

Unbalanced inputs:

ndigits1 | ndigits2 | PG17 rate | patch rate | % change
----------+----------+-----------+------------+----------
1 | 10000 | 44137.258 | 62292.844 | +41.13%
2 | 10000 | 42009.895 | 62705.445 | +49.26%
3 | 10000 | 40569.617 | 58555.727 | +44.33%
4 | 10000 | 38914.03 | 49439.008 | +27.05%
5 | 10000 | 37361.39 | 47173.445 | +26.26%
6 | 10000 | 35807.61 | 42609.203 | +18.99%
7 | 10000 | 34386.684 | 49325.582 | +43.44%
8 | 10000 | 33380.312 | 49298.59 | +47.69%
9 | 10000 | 32228.17 | 46869.844 | +45.43%
10 | 10000 | 31022.46 | 47015.98 | +51.55%
11 | 10000 | 29782.623 | 45074 | +51.34%
12 | 10000 | 29540.896 | 44712.23 | +51.36%
13 | 10000 | 28521.643 | 42589.98 | +49.33%
14 | 10000 | 27772.59 | 43325.863 | +56.00%
15 | 10000 | 26871.25 | 41443 | +54.23%
16 | 10000 | 26179.322 | 41245.508 | +57.55%
17 | 10000 | 26367.402 | 39348.543 | +49.23%
18 | 10000 | 25769.176 | 40105.402 | +55.63%
19 | 10000 | 24869.504 | 38316.44 | +54.07%
20 | 10000 | 24395.436 | 37647.33 | +54.32%
21 | 10000 | 23532.748 | 36552.914 | +55.33%
22 | 10000 | 23151.842 | 36824.04 | +59.05%
23 | 10000 | 22661.33 | 35556.918 | +56.91%
24 | 10000 | 22113.502 | 34923.83 | +57.93%
25 | 10000 | 21481.773 | 33601.785 | +56.42%
26 | 10000 | 20943.576 | 34277.58 | +63.67%
27 | 10000 | 20437.605 | 32957.406 | +61.26%
28 | 10000 | 20049.12 | 32413.64 | +61.67%
29 | 10000 | 19674.787 | 31537.846 | +60.30%
30 | 10000 | 19092.572 | 32252.404 | +68.93%
31 | 10000 | 18761.932 | 30825.836 | +64.30%
32 | 10000 | 18480.184 | 30616.389 | +65.67%
33 | 10000 | 18130.89 | 29493.594 | +62.67%
34 | 10000 | 17750.996 | 30054.01 | +69.31%
35 | 10000 | 17406.83 | 29090.297 | +67.12%
36 | 10000 | 17138.23 | 29117.42 | +69.90%
37 | 10000 | 16666.799 | 28429.32 | +70.57%
38 | 10000 | 16144.025 | 29082.398 | +80.14%
39 | 10000 | 15548.838 | 28195.258 | +81.33%
40 | 10000 | 15305.571 | 27273.215 | +78.19%
50 | 10000 | 11099.766 | 25494.129 | +129.68%
100 | 10000 | 6310.7827 | 14895.447 | +136.03%
250 | 10000 | 2687.7397 | 7149.1016 | +165.99%
500 | 10000 | 1354.7455 | 3608.8845 | +166.39%
1000 | 10000 | 677.3838 | 1852.1256 | +173.42%
2500 | 10000 | 269.74582 | 748.5785 | +177.51%
5000 | 10000 | 132.6432 | 377.23288 | +184.40%

numeric_big regression test:

ok 224 - numeric_big 256 ms [PG17]
ok 224 - numeric_big 161 ms [v3 patch]

Regards,
Dean

Attachment Content-Type Size
v3-0002-Optimise-numeric-multiplication-using-base-NBASE-.patch text/x-patch 22.1 KB
v3-0001-Extend-mul_var_short-to-5-and-6-digit-inputs.patch text/x-patch 11.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2024-07-28 19:43:23 Re: why is pg_upgrade's regression run so slow?
Previous Message Andrey M. Borodin 2024-07-28 18:44:22 Re: UUID v7