From: | "Joel Jacobson" <joel(at)compiler(dot)org> |
---|---|
To: | "Dean Rasheed" <dean(dot)a(dot)rasheed(at)gmail(dot)com> |
Cc: | Dagfinn Ilmari Mannsåker <ilmari(at)ilmari(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands. |
Date: | 2024-07-05 16:42:45 |
Message-ID: | 58e5e7d2-7ad8-40b4-9b76-a5c3049346e5@app.fastmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, Jul 5, 2024, at 17:41, Dean Rasheed wrote:
> On Fri, 5 Jul 2024 at 12:56, Joel Jacobson <joel(at)compiler(dot)org> wrote:
>>
>> Interesting you got so bad bench results for v6-mul_var_int64.patch
>> for var1ndigits=4, that patch is actually the winner on AMD Ryzen 9 7950X3D.
>
> Interesting.
I remeasured just to be sure, and yes, it was the winner among the previous patches,
but the new v7 beats it.
>> On Intel Core i9-14900K the winner is v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch.
>
> That must be random noise, since
> v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch doesn't
> invoke mul_var_small() for 4-digit inputs.
Yes, something was off with the HEAD measurements for that one,
I remeasured and then got almost identical results (as expected)
between HEAD and v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
for 4-digit inputs.
>> On Apple M3 Max, HEAD is the winner.
>
> Importantly, mul_var_int64() is around 1.25x slower there, and it was
> even worse on my machine.
>
> Attached is a v7 mul_var_small() patch adding 4-digit support. For me,
> this gives a nice speedup:
>
> SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4;
> Time: 5617.150 ms (00:05.617) -- HEAD
> Time: 8203.081 ms (00:08.203) -- v6-mul_var_int64.patch
> Time: 4750.212 ms (00:04.750) -- v7-mul_var_small.patch
>
> The other advantage, of course, is that it doesn't require 128-bit
> integer support.
Very nice, v7-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
is now the winner on all my CPUs:
-- Apple M3 Max
SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- HEAD
Time: 3574.865 ms (00:03.575)
Time: 3573.678 ms (00:03.574)
Time: 3576.953 ms (00:03.577)
Time: 3580.536 ms (00:03.581)
Time: 3589.007 ms (00:03.589)
SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v7-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3110.171 ms (00:03.110)
Time: 3098.558 ms (00:03.099)
Time: 3105.873 ms (00:03.106)
Time: 3104.484 ms (00:03.104)
Time: 3109.035 ms (00:03.109)
-- Intel Core i9-14900K
SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- HEAD
Time: 3751.767 ms (00:03.752)
Time: 3745.916 ms (00:03.746)
Time: 3742.542 ms (00:03.743)
Time: 3746.139 ms (00:03.746)
Time: 3745.493 ms (00:03.745)
SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3747.640 ms (00:03.748)
Time: 3747.231 ms (00:03.747)
Time: 3747.965 ms (00:03.748)
Time: 3748.309 ms (00:03.748)
Time: 3746.498 ms (00:03.746)
SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v7-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3417.924 ms (00:03.418)
Time: 3417.088 ms (00:03.417)
Time: 3415.708 ms (00:03.416)
Time: 3415.453 ms (00:03.415)
Time: 3419.566 ms (00:03.420)
-- AMD Ryzen 9 7950X3D
SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- HEAD
Time: 3970.131 ms (00:03.970)
Time: 3924.335 ms (00:03.924)
Time: 3927.863 ms (00:03.928)
Time: 3924.761 ms (00:03.925)
Time: 3926.290 ms (00:03.926)
SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v6-add-mul_var_int64.patch
Time: 3874.769 ms (00:03.875)
Time: 3858.071 ms (00:03.858)
Time: 3836.698 ms (00:03.837)
Time: 3871.388 ms (00:03.871)
Time: 3844.907 ms (00:03.845)
SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v7-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3397.846 ms (00:03.398)
Time: 3398.050 ms (00:03.398)
Time: 3395.279 ms (00:03.395)
Time: 3393.285 ms (00:03.393)
Time: 3402.570 ms (00:03.403)
Code wise I think it's now very nice and clear, with just enough comments.
Also nice to see that the var1ndigits=4 case isn't much more complex
than var1ndigits=3, since it follows the same pattern.
Regards,
Joel
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Geoghegan | 2024-07-05 16:47:49 | Avoiding superfluous buffer locking during nbtree backwards scans |
Previous Message | Tom Lane | 2024-07-05 16:15:03 | Re: Optimize commit performance with a large number of 'on commit delete rows' temp tables |