Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: "Dean Rasheed" <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Dagfinn Ilmari Mannsåker <ilmari(at)ilmari(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.
Date: 2024-07-05 16:42:45
Message-ID: 58e5e7d2-7ad8-40b4-9b76-a5c3049346e5@app.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jul 5, 2024, at 17:41, Dean Rasheed wrote:
> On Fri, 5 Jul 2024 at 12:56, Joel Jacobson <joel(at)compiler(dot)org> wrote:
>>
>> Interesting you got so bad bench results for v6-mul_var_int64.patch
>> for var1ndigits=4, that patch is actually the winner on AMD Ryzen 9 7950X3D.
>
> Interesting.

I remeasured just to be sure, and yes, it was the winner among the previous patches,
but the new v7 beats it.

>> On Intel Core i9-14900K the winner is v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch.
>
> That must be random noise, since
> v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch doesn't
> invoke mul_var_small() for 4-digit inputs.

Yes, something was off with the HEAD measurements for that one,
I remeasured and then got almost identical results (as expected)
between HEAD and v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
for 4-digit inputs.

>> On Apple M3 Max, HEAD is the winner.
>
> Importantly, mul_var_int64() is around 1.25x slower there, and it was
> even worse on my machine.
>
> Attached is a v7 mul_var_small() patch adding 4-digit support. For me,
> this gives a nice speedup:
>
> SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4;
> Time: 5617.150 ms (00:05.617) -- HEAD
> Time: 8203.081 ms (00:08.203) -- v6-mul_var_int64.patch
> Time: 4750.212 ms (00:04.750) -- v7-mul_var_small.patch
>
> The other advantage, of course, is that it doesn't require 128-bit
> integer support.

Very nice, v7-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
is now the winner on all my CPUs:

-- Apple M3 Max

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- HEAD
Time: 3574.865 ms (00:03.575)
Time: 3573.678 ms (00:03.574)
Time: 3576.953 ms (00:03.577)
Time: 3580.536 ms (00:03.581)
Time: 3589.007 ms (00:03.589)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v7-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3110.171 ms (00:03.110)
Time: 3098.558 ms (00:03.099)
Time: 3105.873 ms (00:03.106)
Time: 3104.484 ms (00:03.104)
Time: 3109.035 ms (00:03.109)

-- Intel Core i9-14900K

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- HEAD
Time: 3751.767 ms (00:03.752)
Time: 3745.916 ms (00:03.746)
Time: 3742.542 ms (00:03.743)
Time: 3746.139 ms (00:03.746)
Time: 3745.493 ms (00:03.745)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3747.640 ms (00:03.748)
Time: 3747.231 ms (00:03.747)
Time: 3747.965 ms (00:03.748)
Time: 3748.309 ms (00:03.748)
Time: 3746.498 ms (00:03.746)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v7-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3417.924 ms (00:03.418)
Time: 3417.088 ms (00:03.417)
Time: 3415.708 ms (00:03.416)
Time: 3415.453 ms (00:03.415)
Time: 3419.566 ms (00:03.420)

-- AMD Ryzen 9 7950X3D

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- HEAD
Time: 3970.131 ms (00:03.970)
Time: 3924.335 ms (00:03.924)
Time: 3927.863 ms (00:03.928)
Time: 3924.761 ms (00:03.925)
Time: 3926.290 ms (00:03.926)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v6-add-mul_var_int64.patch
Time: 3874.769 ms (00:03.875)
Time: 3858.071 ms (00:03.858)
Time: 3836.698 ms (00:03.837)
Time: 3871.388 ms (00:03.871)
Time: 3844.907 ms (00:03.845)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v7-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3397.846 ms (00:03.398)
Time: 3398.050 ms (00:03.398)
Time: 3395.279 ms (00:03.395)
Time: 3393.285 ms (00:03.393)
Time: 3402.570 ms (00:03.403)

Code wise I think it's now very nice and clear, with just enough comments.

Also nice to see that the var1ndigits=4 case isn't much more complex
than var1ndigits=3, since it follows the same pattern.

Regards,
Joel

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2024-07-05 16:47:49 Avoiding superfluous buffer locking during nbtree backwards scans
Previous Message Tom Lane 2024-07-05 16:15:03 Re: Optimize commit performance with a large number of 'on commit delete rows' temp tables