Re: Optimising numeric division

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: "Dean Rasheed" <dean(dot)a(dot)rasheed(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Optimising numeric division
Date: 2024-08-23 22:00:30
Message-ID: d3056f41-fb73-4bca-9947-03f9b13be1d7@app.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Aug 23, 2024, at 21:21, Joel Jacobson wrote:
> Attachments:
> * perf_test-M3 Max.out
> * perf_test-Intel Core i9-14900K.out
> * perf_test-AMD Ryzen 9 7950X3D.out

Here are some additional benchmarks from pg-catbench:

AMD Ryzen 9 7950X3D:

select x var1ndigits,y var2ndigits,a_avg,b_avg,pooled_stddev,abs_diff,rel_diff,sigmas from catbench.vreport where summary like 'Optimise numeric division using base-NBASE^2 arithmetic%' and function_name = 'numeric_div' order by 1,2;
var1ndigits | var2ndigits | a_avg | b_avg | pooled_stddev | abs_diff | rel_diff | sigmas
-------------+-------------+--------+--------+---------------+----------+----------+--------
1 | 1 | 43 ns | 39 ns | 11 ns | -3.3 ns | -8 | 0
1 | 2 | 47 ns | 43 ns | 11 ns | -3.2 ns | -7 | 0
1 | 3 | 55 ns | 89 ns | 18 ns | 33 ns | 60 | 2
1 | 4 | 76 ns | 93 ns | 20 ns | 16 ns | 22 | 1
1 | 8 | 190 ns | 98 ns | 36 ns | -94 ns | -49 | 3
1 | 16 | 280 ns | 120 ns | 52 ns | -160 ns | -57 | 3
1 | 32 | 490 ns | 140 ns | 98 ns | -350 ns | -72 | 4
1 | 64 | 780 ns | 190 ns | 110 ns | -590 ns | -75 | 5
1 | 128 | 1.4 µs | 330 ns | 85 ns | -1.1 µs | -77 | 13
1 | 256 | 590 ns | 210 ns | 120 ns | -380 ns | -65 | 3
1 | 512 | 1.6 µs | 430 ns | 460 ns | -1.2 µs | -74 | 3
1 | 1024 | 2.8 µs | 820 ns | 1 µs | -2 µs | -71 | 2
1 | 2048 | 6.6 µs | 1.4 µs | 1.9 µs | -5.2 µs | -78 | 3
1 | 4096 | 11 µs | 2.8 µs | 2.5 µs | -8.5 µs | -76 | 3
1 | 8192 | 25 µs | 5.6 µs | 7.4 µs | -20 µs | -78 | 3
1 | 16384 | 53 µs | 15 µs | 15 µs | -37 µs | -71 | 3
2 | 2 | 49 ns | 49 ns | 12 ns | 2e-10 s | 0 | 0
2 | 3 | 54 ns | 93 ns | 16 ns | 39 ns | 73 | 2
2 | 4 | 86 ns | 97 ns | 19 ns | 10 ns | 12 | 1
2 | 8 | 200 ns | 89 ns | 36 ns | -110 ns | -56 | 3
2 | 16 | 340 ns | 120 ns | 63 ns | -210 ns | -63 | 3
2 | 32 | 460 ns | 130 ns | 84 ns | -320 ns | -71 | 4
2 | 64 | 770 ns | 210 ns | 68 ns | -560 ns | -73 | 8
2 | 128 | 1.5 µs | 330 ns | 290 ns | -1.2 µs | -78 | 4
2 | 256 | 1.2 µs | 380 ns | 160 ns | -870 ns | -70 | 5
2 | 512 | 1.9 µs | 520 ns | 510 ns | -1.4 µs | -73 | 3
2 | 1024 | 3.9 µs | 1 µs | 1.2 µs | -2.9 µs | -74 | 2
2 | 2048 | 7.8 µs | 2.1 µs | 2.5 µs | -5.7 µs | -74 | 2
2 | 4096 | 17 µs | 5.3 µs | 2 µs | -12 µs | -69 | 6
2 | 8192 | 30 µs | 7.7 µs | 8.5 µs | -22 µs | -74 | 3
2 | 16384 | 35 µs | 7.3 µs | 7.2 µs | -27 µs | -79 | 4
3 | 3 | 52 ns | 88 ns | 16 ns | 36 ns | 69 | 2
3 | 4 | 78 ns | 97 ns | 24 ns | 19 ns | 24 | 1
3 | 8 | 210 ns | 94 ns | 38 ns | -120 ns | -56 | 3
3 | 16 | 300 ns | 130 ns | 53 ns | -170 ns | -57 | 3
3 | 32 | 510 ns | 140 ns | 95 ns | -370 ns | -72 | 4
3 | 64 | 800 ns | 230 ns | 100 ns | -570 ns | -72 | 6
3 | 128 | 1.4 µs | 290 ns | 210 ns | -1.1 µs | -79 | 5
3 | 256 | 900 ns | 270 ns | 310 ns | -630 ns | -70 | 2
3 | 512 | 1.4 µs | 470 ns | 550 ns | -940 ns | -67 | 2
3 | 1024 | 2.2 µs | 510 ns | 460 ns | -1.7 µs | -77 | 4
3 | 2048 | 5.5 µs | 1.6 µs | 2.1 µs | -4 µs | -72 | 2
3 | 4096 | 13 µs | 3.8 µs | 3.7 µs | -9.3 µs | -71 | 3
3 | 8192 | 29 µs | 7.6 µs | 8.3 µs | -22 µs | -74 | 3
3 | 16384 | 43 µs | 11 µs | 17 µs | -32 µs | -74 | 2
4 | 4 | 85 ns | 92 ns | 20 ns | 6.6 ns | 8 | 0
4 | 8 | 200 ns | 89 ns | 37 ns | -110 ns | -56 | 3
4 | 16 | 320 ns | 120 ns | 61 ns | -200 ns | -61 | 3
4 | 32 | 470 ns | 140 ns | 88 ns | -320 ns | -69 | 4
4 | 64 | 800 ns | 220 ns | 120 ns | -580 ns | -72 | 5
4 | 128 | 1.5 µs | 330 ns | 250 ns | -1.2 µs | -78 | 5
4 | 256 | 890 ns | 340 ns | 310 ns | -550 ns | -61 | 2
4 | 512 | 1.2 µs | 460 ns | 330 ns | -730 ns | -62 | 2
4 | 1024 | 2.5 µs | 790 ns | 860 ns | -1.7 µs | -69 | 2
4 | 2048 | 3.5 µs | 950 ns | 400 ns | -2.6 µs | -73 | 7
4 | 4096 | 13 µs | 4 µs | 3.8 µs | -9 µs | -69 | 2
4 | 8192 | 30 µs | 7.7 µs | 8.2 µs | -22 µs | -74 | 3
4 | 16384 | 61 µs | 20 µs | 7.1 µs | -42 µs | -68 | 6
8 | 8 | 200 ns | 94 ns | 38 ns | -110 ns | -53 | 3
8 | 16 | 330 ns | 110 ns | 63 ns | -210 ns | -66 | 3
8 | 32 | 480 ns | 140 ns | 86 ns | -340 ns | -71 | 4
8 | 64 | 800 ns | 210 ns | 49 ns | -590 ns | -74 | 12
8 | 128 | 1.6 µs | 320 ns | 190 ns | -1.2 µs | -79 | 7
8 | 256 | 1.6 µs | 460 ns | 290 ns | -1.1 µs | -71 | 4
8 | 512 | 1.9 µs | 570 ns | 550 ns | -1.3 µs | -70 | 2
8 | 1024 | 4 µs | 1 µs | 1.1 µs | -3 µs | -75 | 3
8 | 2048 | 6.6 µs | 1.4 µs | 2.4 µs | -5.2 µs | -78 | 2
8 | 4096 | 19 µs | 5.2 µs | 870 ns | -14 µs | -73 | 16
8 | 8192 | 22 µs | 5.8 µs | 8.2 µs | -16 µs | -73 | 2
8 | 16384 | 50 µs | 11 µs | 14 µs | -39 µs | -78 | 3
16 | 16 | 310 ns | 130 ns | 57 ns | -180 ns | -59 | 3
16 | 32 | 460 ns | 160 ns | 63 ns | -310 ns | -66 | 5
16 | 64 | 820 ns | 210 ns | 130 ns | -610 ns | -74 | 5
16 | 128 | 1.4 µs | 310 ns | 180 ns | -1.1 µs | -78 | 6
16 | 256 | 2.8 µs | 510 ns | 150 ns | -2.3 µs | -82 | 15
16 | 512 | 1.2 µs | 320 ns | 340 ns | -840 ns | -72 | 2
16 | 1024 | 4.6 µs | 1.2 µs | 980 ns | -3.4 µs | -73 | 3
16 | 2048 | 7.7 µs | 1.9 µs | 2.4 µs | -5.8 µs | -75 | 2
16 | 4096 | 15 µs | 4.1 µs | 4.3 µs | -11 µs | -72 | 3
16 | 8192 | 14 µs | 3.6 µs | 480 ns | -10 µs | -74 | 21
16 | 16384 | 28 µs | 6.9 µs | 430 ns | -21 µs | -75 | 48
32 | 32 | 570 ns | 150 ns | 120 ns | -420 ns | -74 | 4
32 | 64 | 860 ns | 220 ns | 120 ns | -640 ns | -74 | 5
32 | 128 | 1.4 µs | 350 ns | 160 ns | -1.1 µs | -76 | 7
32 | 256 | 2.9 µs | 530 ns | 420 ns | -2.4 µs | -82 | 6
32 | 512 | 2.3 µs | 750 ns | 410 ns | -1.6 µs | -67 | 4
32 | 1024 | 3.7 µs | 1 µs | 1 µs | -2.7 µs | -73 | 3
32 | 2048 | 5.4 µs | 1.6 µs | 2.2 µs | -3.8 µs | -70 | 2
32 | 4096 | 11 µs | 1.8 µs | 1.7 µs | -9.2 µs | -83 | 5
32 | 8192 | 25 µs | 6.1 µs | 7.5 µs | -19 µs | -76 | 3
32 | 16384 | 59 µs | 15 µs | 17 µs | -44 µs | -74 | 3
64 | 64 | 830 ns | 230 ns | 150 ns | -610 ns | -73 | 4
64 | 128 | 1.4 µs | 330 ns | 100 ns | -1.1 µs | -77 | 11
64 | 256 | 2.9 µs | 520 ns | 170 ns | -2.3 µs | -82 | 14
64 | 512 | 1.7 µs | 540 ns | 530 ns | -1.2 µs | -69 | 2
64 | 1024 | 2.7 µs | 770 ns | 580 ns | -2 µs | -72 | 3
64 | 2048 | 4.3 µs | 1 µs | 950 ns | -3.3 µs | -77 | 3
64 | 4096 | 17 µs | 3.8 µs | 2.5 µs | -13 µs | -77 | 5
64 | 8192 | 38 µs | 9.9 µs | 1.3 µs | -28 µs | -74 | 21
64 | 16384 | 66 µs | 15 µs | 10 µs | -51 µs | -77 | 5
128 | 128 | 1.6 µs | 380 ns | 180 ns | -1.2 µs | -76 | 7
128 | 256 | 2.6 µs | 530 ns | 100 ns | -2.1 µs | -80 | 20
128 | 512 | 1.2 µs | 380 ns | 300 ns | -840 ns | -69 | 3
128 | 1024 | 5.3 µs | 1.3 µs | 1 µs | -4 µs | -75 | 4
128 | 2048 | 5.5 µs | 980 ns | 960 ns | -4.5 µs | -82 | 5
128 | 4096 | 13 µs | 2.8 µs | 3.9 µs | -10 µs | -78 | 3
128 | 8192 | 18 µs | 5.9 µs | 5.2 µs | -12 µs | -68 | 2
128 | 16384 | 59 µs | 15 µs | 9.3 µs | -44 µs | -75 | 5
256 | 256 | 3.3 µs | 590 ns | 540 ns | -2.7 µs | -82 | 5
256 | 512 | 1.2 µs | 410 ns | 340 ns | -830 ns | -67 | 2
256 | 1024 | 2.9 µs | 810 ns | 1.1 µs | -2.1 µs | -72 | 2
256 | 2048 | 9.1 µs | 2.4 µs | 1.7 µs | -6.7 µs | -74 | 4
256 | 4096 | 13 µs | 3.8 µs | 3.7 µs | -9.4 µs | -71 | 2
256 | 8192 | 30 µs | 8 µs | 4.7 µs | -22 µs | -73 | 5
256 | 16384 | 43 µs | 11 µs | 17 µs | -32 µs | -74 | 2
512 | 512 | 6.6 µs | 1 µs | 790 ns | -5.6 µs | -84 | 7
512 | 1024 | 4.4 µs | 980 ns | 1.4 µs | -3.4 µs | -78 | 2
512 | 2048 | 9.6 µs | 2 µs | 2.2 µs | -7.6 µs | -79 | 4
512 | 4096 | 9.3 µs | 1.9 µs | 2.1 µs | -7.4 µs | -79 | 3
512 | 8192 | 27 µs | 7.9 µs | 7.8 µs | -19 µs | -70 | 2
512 | 16384 | 60 µs | 15 µs | 17 µs | -44 µs | -74 | 3
1024 | 1024 | 12 µs | 2 µs | 960 ns | -9.9 µs | -83 | 10
1024 | 2048 | 4.7 µs | 1.5 µs | 1.2 µs | -3.2 µs | -67 | 3
1024 | 4096 | 11 µs | 3.1 µs | 4.6 µs | -8.2 µs | -73 | 2
1024 | 8192 | 22 µs | 5.8 µs | 8.8 µs | -17 µs | -74 | 2
1024 | 16384 | 35 µs | 7.4 µs | 7.5 µs | -28 µs | -79 | 4
2048 | 2048 | 24 µs | 3.8 µs | 1.5 µs | -20 µs | -84 | 13
2048 | 4096 | 17 µs | 4.1 µs | 5 µs | -13 µs | -75 | 3
2048 | 8192 | 27 µs | 7.9 µs | 8 µs | -19 µs | -71 | 2
2048 | 16384 | 45 µs | 12 µs | 18 µs | -33 µs | -74 | 2
4096 | 4096 | 51 µs | 8.3 µs | 1.7 µs | -42 µs | -84 | 24
4096 | 8192 | 28 µs | 7.5 µs | 8.6 µs | -21 µs | -73 | 2
4096 | 16384 | 28 µs | 8.4 µs | 1.1 µs | -19 µs | -70 | 17
8192 | 8192 | 80 µs | 14 µs | 1.3 µs | -66 µs | -82 | 49
8192 | 16384 | 66 µs | 16 µs | 20 µs | -50 µs | -76 | 3
16384 | 16384 | 200 µs | 30 µs | 2.4 µs | -170 µs | -85 | 71
(136 rows)

Since microbenchmark results are not normally distributed,
the sigmas and stddev columns unfortunately don't say much at all,
they are just an attempt to give some form of indication of variance.
Any ideas on better indicators appreciated.

Here are the same report for Intel Core i9-14900K:

select x var1ndigits,y var2ndigits,a_avg,b_avg,pooled_stddev,abs_diff,rel_diff,sigmas from catbench.vreport where summary like 'Optimise numeric division using base-NBASE^2 arithmetic%' and function_name = 'numeric_div' order by 1,2;
var1ndigits | var2ndigits | a_avg | b_avg | pooled_stddev | abs_diff | rel_diff | sigmas
-------------+-------------+--------+--------+---------------+------------+----------+--------
1 | 1 | 72 ns | 72 ns | 6.8 ns | -1.9e-10 s | 0 | 0
1 | 2 | 76 ns | 77 ns | 8.5 ns | 1.1 ns | 1 | 0
1 | 3 | 87 ns | 120 ns | 10 ns | 38 ns | 43 | 4
1 | 4 | 98 ns | 120 ns | 12 ns | 27 ns | 27 | 2
1 | 8 | 340 ns | 130 ns | 19 ns | -200 ns | -60 | 11
1 | 16 | 500 ns | 160 ns | 34 ns | -350 ns | -69 | 10
1 | 32 | 850 ns | 200 ns | 81 ns | -660 ns | -77 | 8
1 | 64 | 1.6 µs | 300 ns | 130 ns | -1.3 µs | -82 | 11
1 | 128 | 3.2 µs | 520 ns | 180 ns | -2.7 µs | -84 | 15
1 | 256 | 2.2 µs | 570 ns | 360 ns | -1.6 µs | -74 | 5
1 | 512 | 4.4 µs | 1.1 µs | 1.2 µs | -3.3 µs | -75 | 3
1 | 1024 | 4.9 µs | 1 µs | 1 µs | -3.9 µs | -79 | 4
1 | 2048 | 15 µs | 4.2 µs | 4.1 µs | -11 µs | -72 | 3
1 | 4096 | 42 µs | 11 µs | 3.3 µs | -31 µs | -75 | 10
1 | 8192 | 86 µs | 22 µs | 4.5 µs | -65 µs | -75 | 15
1 | 16384 | 65 µs | 17 µs | 4 µs | -47 µs | -73 | 12
2 | 2 | 78 ns | 83 ns | 8.2 ns | 5.1 ns | 7 | 1
2 | 3 | 89 ns | 120 ns | 12 ns | 29 ns | 33 | 3
2 | 4 | 97 ns | 130 ns | 13 ns | 29 ns | 30 | 2
2 | 8 | 310 ns | 130 ns | 30 ns | -180 ns | -58 | 6
2 | 16 | 520 ns | 160 ns | 34 ns | -360 ns | -69 | 11
2 | 32 | 980 ns | 210 ns | 14 ns | -780 ns | -79 | 57
2 | 64 | 1.6 µs | 300 ns | 150 ns | -1.3 µs | -81 | 9
2 | 128 | 3.1 µs | 530 ns | 270 ns | -2.6 µs | -83 | 10
2 | 256 | 2.8 µs | 710 ns | 150 ns | -2.1 µs | -74 | 14
2 | 512 | 4.2 µs | 1.1 µs | 720 ns | -3.1 µs | -73 | 4
2 | 1024 | 6.7 µs | 2.2 µs | 1.5 µs | -4.5 µs | -67 | 3
2 | 2048 | 7.7 µs | 2 µs | 690 ns | -5.7 µs | -74 | 8
2 | 4096 | 20 µs | 4 µs | 4 µs | -16 µs | -81 | 4
2 | 8192 | 65 µs | 13 µs | 12 µs | -52 µs | -80 | 4
2 | 16384 | 100 µs | 26 µs | 39 µs | -78 µs | -75 | 2
3 | 3 | 87 ns | 130 ns | 10 ns | 39 ns | 45 | 4
3 | 4 | 100 ns | 130 ns | 13 ns | 27 ns | 27 | 2
3 | 8 | 330 ns | 120 ns | 21 ns | -210 ns | -63 | 10
3 | 16 | 470 ns | 160 ns | 38 ns | -310 ns | -66 | 8
3 | 32 | 910 ns | 190 ns | 72 ns | -720 ns | -79 | 10
3 | 64 | 1.7 µs | 300 ns | 130 ns | -1.4 µs | -83 | 11
3 | 128 | 3.2 µs | 510 ns | 250 ns | -2.7 µs | -84 | 11
3 | 256 | 1.9 µs | 570 ns | 530 ns | -1.4 µs | -71 | 3
3 | 512 | 3.3 µs | 770 ns | 1.2 µs | -2.5 µs | -77 | 2
3 | 1024 | 7.6 µs | 2 µs | 2.2 µs | -5.6 µs | -73 | 3
3 | 2048 | 12 µs | 3 µs | 4.8 µs | -9.4 µs | -76 | 2
3 | 4096 | 30 µs | 8.4 µs | 8.7 µs | -21 µs | -72 | 2
3 | 8192 | 68 µs | 17 µs | 19 µs | -51 µs | -75 | 3
3 | 16384 | 100 µs | 26 µs | 39 µs | -76 µs | -75 | 2
4 | 4 | 100 ns | 130 ns | 13 ns | 28 ns | 28 | 2
4 | 8 | 300 ns | 130 ns | 33 ns | -170 ns | -56 | 5
4 | 16 | 510 ns | 160 ns | 35 ns | -350 ns | -69 | 10
4 | 32 | 910 ns | 210 ns | 77 ns | -700 ns | -77 | 9
4 | 64 | 1.7 µs | 310 ns | 97 ns | -1.4 µs | -82 | 14
4 | 128 | 3.1 µs | 490 ns | 270 ns | -2.6 µs | -84 | 10
4 | 256 | 1.9 µs | 440 ns | 530 ns | -1.4 µs | -77 | 3
4 | 512 | 2.8 µs | 800 ns | 730 ns | -2 µs | -71 | 3
4 | 1024 | 9.4 µs | 2.1 µs | 1.6 µs | -7.3 µs | -77 | 5
4 | 2048 | 13 µs | 3 µs | 4.9 µs | -9.7 µs | -76 | 2
4 | 4096 | 34 µs | 8.4 µs | 9.9 µs | -26 µs | -75 | 3
4 | 8192 | 61 µs | 17 µs | 17 µs | -44 µs | -73 | 3
4 | 16384 | 120 µs | 26 µs | 34 µs | -89 µs | -77 | 3
8 | 8 | 310 ns | 120 ns | 33 ns | -180 ns | -59 | 6
8 | 16 | 540 ns | 170 ns | 30 ns | -360 ns | -68 | 12
8 | 32 | 930 ns | 200 ns | 75 ns | -730 ns | -79 | 10
8 | 64 | 1.7 µs | 310 ns | 120 ns | -1.4 µs | -82 | 12
8 | 128 | 3.3 µs | 510 ns | 270 ns | -2.7 µs | -84 | 10
8 | 256 | 4.7 µs | 880 ns | 330 ns | -3.8 µs | -81 | 11
8 | 512 | 3.7 µs | 800 ns | 1 µs | -2.9 µs | -79 | 3
8 | 1024 | 6.3 µs | 1.5 µs | 2.5 µs | -4.8 µs | -76 | 2
8 | 2048 | 14 µs | 3 µs | 4.2 µs | -11 µs | -79 | 3
8 | 4096 | 29 µs | 6.2 µs | 8.6 µs | -23 µs | -79 | 3
8 | 8192 | 32 µs | 8.3 µs | 2 µs | -24 µs | -74 | 12
8 | 16384 | 120 µs | 25 µs | 34 µs | -94 µs | -79 | 3
16 | 16 | 490 ns | 160 ns | 55 ns | -330 ns | -67 | 6
16 | 32 | 1 µs | 200 ns | 38 ns | -810 ns | -80 | 21
16 | 64 | 1.6 µs | 310 ns | 130 ns | -1.3 µs | -81 | 10
16 | 128 | 3.2 µs | 530 ns | 250 ns | -2.7 µs | -83 | 11
16 | 256 | 6.4 µs | 950 ns | 440 ns | -5.5 µs | -85 | 13
16 | 512 | 3.1 µs | 780 ns | 1.2 µs | -2.3 µs | -74 | 2
16 | 1024 | 5.2 µs | 1.5 µs | 1.4 µs | -3.8 µs | -72 | 3
16 | 2048 | 22 µs | 5.2 µs | 1.1 µs | -16 µs | -76 | 15
16 | 4096 | 29 µs | 8.4 µs | 8.1 µs | -21 µs | -71 | 3
16 | 8192 | 76 µs | 17 µs | 12 µs | -59 µs | -78 | 5
16 | 16384 | 120 µs | 26 µs | 12 µs | -90 µs | -77 | 8
32 | 32 | 970 ns | 220 ns | 98 ns | -740 ns | -77 | 8
32 | 64 | 1.7 µs | 310 ns | 140 ns | -1.3 µs | -81 | 10
32 | 128 | 3.3 µs | 520 ns | 280 ns | -2.8 µs | -84 | 10
32 | 256 | 6.9 µs | 980 ns | 250 ns | -5.9 µs | -86 | 23
32 | 512 | 3.7 µs | 790 ns | 1.1 µs | -2.9 µs | -79 | 3
32 | 1024 | 6.2 µs | 1.6 µs | 2.5 µs | -4.7 µs | -75 | 2
32 | 2048 | 22 µs | 5.3 µs | 840 ns | -17 µs | -76 | 20
32 | 4096 | 20 µs | 4 µs | 4.3 µs | -16 µs | -80 | 4
32 | 8192 | 58 µs | 13 µs | 17 µs | -45 µs | -78 | 3
32 | 16384 | 140 µs | 34 µs | 19 µs | -100 µs | -75 | 5
64 | 64 | 2 µs | 340 ns | 74 ns | -1.6 µs | -83 | 22
64 | 128 | 3.6 µs | 550 ns | 190 ns | -3.1 µs | -85 | 17
64 | 256 | 6.1 µs | 930 ns | 600 ns | -5.2 µs | -85 | 9
64 | 512 | 3.3 µs | 790 ns | 1.3 µs | -2.5 µs | -76 | 2
64 | 1024 | 9.5 µs | 2 µs | 1.5 µs | -7.4 µs | -79 | 5
64 | 2048 | 13 µs | 3.1 µs | 4.9 µs | -9.4 µs | -75 | 2
64 | 4096 | 39 µs | 11 µs | 4.3 µs | -29 µs | -73 | 7
64 | 8192 | 50 µs | 12 µs | 19 µs | -38 µs | -75 | 2
64 | 16384 | 64 µs | 17 µs | 4.5 µs | -47 µs | -73 | 10
128 | 128 | 3 µs | 540 ns | 210 ns | -2.5 µs | -82 | 12
128 | 256 | 6.9 µs | 1 µs | 390 ns | -5.9 µs | -85 | 15
128 | 512 | 2.7 µs | 810 ns | 730 ns | -1.9 µs | -70 | 3
128 | 1024 | 5.1 µs | 1.6 µs | 1.5 µs | -3.5 µs | -68 | 2
128 | 2048 | 20 µs | 5.2 µs | 2.4 µs | -15 µs | -74 | 6
128 | 4096 | 26 µs | 6.1 µs | 9.8 µs | -19 µs | -76 | 2
128 | 8192 | 60 µs | 17 µs | 17 µs | -44 µs | -72 | 3
128 | 16384 | 100 µs | 26 µs | 39 µs | -77 µs | -74 | 2
256 | 256 | 5.9 µs | 1 µs | 370 ns | -4.9 µs | -83 | 13
256 | 512 | 4.2 µs | 850 ns | 1.2 µs | -3.4 µs | -80 | 3
256 | 1024 | 12 µs | 2.6 µs | 180 ns | -9.3 µs | -78 | 51
256 | 2048 | 20 µs | 5 µs | 2.4 µs | -15 µs | -75 | 6
256 | 4096 | 30 µs | 6.3 µs | 8.8 µs | -24 µs | -79 | 3
256 | 8192 | 69 µs | 17 µs | 19 µs | -52 µs | -75 | 3
256 | 16384 | 150 µs | 35 µs | 23 µs | -120 µs | -78 | 5
512 | 512 | 14 µs | 2.1 µs | 1.1 µs | -12 µs | -85 | 11
512 | 1024 | 9.7 µs | 1.8 µs | 1.4 µs | -7.9 µs | -82 | 6
512 | 2048 | 10 µs | 2.1 µs | 2.4 µs | -8.3 µs | -79 | 3
512 | 4096 | 20 µs | 4.2 µs | 4.5 µs | -16 µs | -79 | 4
512 | 8192 | 88 µs | 21 µs | 4.6 µs | -67 µs | -76 | 15
512 | 16384 | 140 µs | 34 µs | 38 µs | -100 µs | -76 | 3
1024 | 1024 | 25 µs | 4 µs | 2.7 µs | -21 µs | -84 | 8
1024 | 2048 | 16 µs | 4.2 µs | 5.2 µs | -12 µs | -74 | 2
1024 | 4096 | 40 µs | 8.1 µs | 6.7 µs | -32 µs | -80 | 5
1024 | 8192 | 80 µs | 17 µs | 12 µs | -63 µs | -79 | 5
1024 | 16384 | 120 µs | 35 µs | 35 µs | -89 µs | -72 | 3
2048 | 2048 | 55 µs | 8 µs | 5.2 µs | -46 µs | -85 | 9
2048 | 4096 | 28 µs | 6.2 µs | 6 µs | -22 µs | -78 | 4
2048 | 8192 | 53 µs | 13 µs | 21 µs | -40 µs | -76 | 2
2048 | 16384 | 100 µs | 26 µs | 40 µs | -76 µs | -74 | 2
4096 | 4096 | 110 µs | 16 µs | 8.8 µs | -95 µs | -86 | 11
4096 | 8192 | 43 µs | 13 µs | 12 µs | -30 µs | -69 | 2
4096 | 16384 | 150 µs | 36 µs | 42 µs | -110 µs | -76 | 3
8192 | 8192 | 210 µs | 32 µs | 22 µs | -180 µs | -85 | 8
8192 | 16384 | 160 µs | 34 µs | 47 µs | -120 µs | -79 | 3
16384 | 16384 | 370 µs | 61 µs | 23 µs | -310 µs | -84 | 14
(136 rows)

Out of these, some appears to be slower, but not sure if actually so,
might just be noise, since quite few sigmas, and like said above,
the sigmas isn't very scientific since the data can't be assumed
to be normally distributed:

AMD Ryzen 9 7950X3D:

select x var1ndigits,y var2ndigits,a_avg,b_avg,pooled_stddev,abs_diff,rel_diff,sigmas from catbench.vreport where summary like 'Optimise numeric division using base-NBASE^2 arithmetic%' and function_name = 'numeric_div' and rel_diff > 0 order by 1,2;
var1ndigits | var2ndigits | a_avg | b_avg | pooled_stddev | abs_diff | rel_diff | sigmas
-------------+-------------+-------+-------+---------------+----------+----------+--------
1 | 3 | 55 ns | 89 ns | 18 ns | 33 ns | 60 | 2
1 | 4 | 76 ns | 93 ns | 20 ns | 16 ns | 22 | 1
2 | 3 | 54 ns | 93 ns | 16 ns | 39 ns | 73 | 2
2 | 4 | 86 ns | 97 ns | 19 ns | 10 ns | 12 | 1
3 | 3 | 52 ns | 88 ns | 16 ns | 36 ns | 69 | 2
3 | 4 | 78 ns | 97 ns | 24 ns | 19 ns | 24 | 1
4 | 4 | 85 ns | 92 ns | 20 ns | 6.6 ns | 8 | 0
(7 rows)

Intel Core i9-14900K:

select x var1ndigits,y var2ndigits,a_avg,b_avg,pooled_stddev,abs_diff,rel_diff,sigmas from catbench.vreport where summary like 'Optimise numeric division using base-NBASE^2 arithmetic%' and function_name = 'numeric_div' and rel_diff > 0 order by 1,2;
var1ndigits | var2ndigits | a_avg | b_avg | pooled_stddev | abs_diff | rel_diff | sigmas
-------------+-------------+--------+--------+---------------+----------+----------+--------
1 | 2 | 76 ns | 77 ns | 8.5 ns | 1.1 ns | 1 | 0
1 | 3 | 87 ns | 120 ns | 10 ns | 38 ns | 43 | 4
1 | 4 | 98 ns | 120 ns | 12 ns | 27 ns | 27 | 2
2 | 2 | 78 ns | 83 ns | 8.2 ns | 5.1 ns | 7 | 1
2 | 3 | 89 ns | 120 ns | 12 ns | 29 ns | 33 | 3
2 | 4 | 97 ns | 130 ns | 13 ns | 29 ns | 30 | 2
3 | 3 | 87 ns | 130 ns | 10 ns | 39 ns | 45 | 4
3 | 4 | 100 ns | 130 ns | 13 ns | 27 ns | 27 | 2
4 | 4 | 100 ns | 130 ns | 13 ns | 28 ns | 28 | 2
(9 rows)

Quite similar (var1ndigits,var2ndigits) pairs that seems slower between
the two CPUs, so maybe it actually is a slowdown.

These benchmark results were obtained by comparing
numeric_div() between HEAD (ff59d5d) and with
v1-0001-Optimise-numeric-division-using-base-NBASE-2-arit.patch
applied.

Since statistical tools that rely on normal distributions can't be used,
let's look at the individual measurements for (var1ndigits=3, var2ndigits=3)
since that seems to be the biggest slowdown on both CPUs,
and see if our level of surprise is affected.

This is how many microseconds 512 iterations took
when comparing HEAD vs v1-0001:

AMD Ryzen 9 7950X3D:
{21,31,31,21,21,21,32,34,21,21,22,35,36,35,39,21,35,21,21,30,21,21,21,20,31,36,22,22,37,33} -- HEAD (ff59d5d)
{49,60,32,56,48,55,48,48,32,32,48,48,32,63,48,49,49,56,48,49,33,47,32,47,33,55,55,56,33,32} -- v1-0001

Intel Core i9-14900K:
{45,36,46,45,49,45,46,49,46,46,46,45,49,49,45,33,46,46,36,49,45,49,46,45,46,49,45,49,46,45} -- HEAD (ff59d5d)
{70,63,70,70,70,51,63,45,69,69,63,70,70,69,70,62,69,70,63,70,69,69,63,69,64,70,69,63,64,51} -- v1-0001

n=30 (3 random vars * 10 measurements)

(The reason why Intel is slower than AMD is because the Intel is running at a fixed CPU frequency.)

Regards,
Joel

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Samuel Marks 2024-08-23 22:15:21 Re: [PATCH] Guard `CLOBBER_FREED_MEMORY` & `MEMORY_CONTEXT_CHECKING`
Previous Message Jonathan S. Katz 2024-08-23 21:44:37 Re: On disable_cost