Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Joel Jacobson <joel(at)compiler(dot)org>, Aaron Altman <aaronaltman(at)posteo(dot)net>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Optimize numeric.c mul_var() using the Karatsuba algorithm
Date: 2024-06-29 12:22:09
Message-ID: CAEZATCXQnJO-GbV-8QPWsjRuBodoFDbbbinRKANFUazoFW=YBQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Jun 23, 2024 at 09:00:29AM +0200, Joel Jacobson wrote:
> Attached, rebased version of the patch that implements the Karatsuba algorithm in numeric.c's mul_var().
>

Something to watch out for is that not all callers of mul_var() want
an exact result. Several internal callers request an approximate
result by passing it an rscale value less than the sum of the input
dscales. The schoolbook algorithm handles that by computing up to
rscale plus MUL_GUARD_DIGITS extra digits and then rounding, whereas
the new Karatsuba code always computes the full result and then
rounds. That impacts the performance of various functions, for
example:

select sum(exp(x)) from generate_series(5999.9, 5950.0, -0.1) x;

Time: 1790.825 ms (00:01.791) [HEAD]
Time: 2161.244 ms (00:02.161) [with patch]

Looking at mul_var_karatsuba_half(), I don't really like the approach
it takes. The whole correctness proof using the Karatsuba formula
seems to somewhat miss the point that this function isn't actually
implementing the Karatsuba algorithm, it is implementing the
schoolbook algorithm in two steps, by splitting the longer input into
two pieces. But why just split it into two pieces? That will just lead
to a lot of unnecessary recursion for very unbalanced inputs. Instead,
why not split the longer input into N roughly equal sized pieces, each
around the same length as the shorter input, multiplying and adding
them at the appropriate offsets? As an example, given inputs with
var1ndigits = 1000 and var2ndigits = 10000, mul_var() will invoke
mul_var_karatsuba_half(), which will then recursively invoke mul_var()
twice with var1ndigits = 1000 and var2ndigits = 5000, which no longer
satisfies KARATSUBA_CONDITION(), so it will just invoke the schoolbook
algorithm on each half, which stands no chance of being any faster. On
the other hand, if it divided var2 into 10 chunks of length 1000, it
would invoke the Karatsuba algorithm on each chunk, which would at
least stand a chance of being faster.

Related to that, KARATSUBA_HIGH_RANGE_CONDITION() doesn't appear to
make a lot of sense. For inputs with var1ndigits between 128 and 2000,
and var2ndigits > 9000, this condition will pass and it will
recursively break up the longer input into smaller and smaller pieces
until eventually that condition no longer passes, but none of the
other conditions in KARATSUBA_CONDITION() will pass either, so it'll
just invoke the schoolbook algorithm on each piece, which is bound to
be slower once all the overheads are taken into account. For example,
given var1ndigits = 200 and var2ndigits = 30000, KARATSUBA_CONDITION()
will pass due to KARATSUBA_HIGH_RANGE_CONDITION(), and it will recurse
with var1ndigits = 200 and var2ndigits = 15000, and then again with
var1ndigits = 200 and var2ndigits = 7500, at which point
KARATSUBA_CONDITION() no longer passes. With mul_var_karatsuba_half()
implemented as it is, that is bound to happen, because each half will
end up having var2ndigits between 4500 and 9000, which fails
KARATSUBA_CONDITION() if var1ndigits < 2000. If
mul_var_karatsuba_half() was replaced by something that recursed with
more balanced chunks, then it might make more sense, though allowing
values of var1ndigits down to 128 doesn't make sense, since the
Karatsuba algorithm will never be invoked for inputs shorter than 384.

Looking at KARATSUBA_MIDDLE_RANGE_CONDITION(), the test that
var2ndigits > 2500 seems to be redundant. If var1ndigits > 2000 and
var2ndigits < 2500, then KARATSUBA_LOW_RANGE_CONDITION() is satisfied,
so these tests could be simplified, eliminating some of those magic
constants.

However, I really don't like having these magic constants at all,
because in practice the threshold above which the Karatsuba algorithm
is a win can vary depending on a number of factors, such as whether
it's running on 32-bit or 64-bit, whether or not SIMD instructions are
available, the relative timings of CPU instructions, the compiler
options used, and probably a bunch of other things. The last time I
looked at the Java source code, for example, they had separate
thresholds for 32-bit and 64-bit platforms, and even that's probably
too crude. Some numeric libraries tune the thresholds for a large
number of different platforms, but that takes a lot of effort. I think
a better approach would be to have a configurable threshold. Ideally,
this would be just one number, with all other numbers being derived
from it, possibly using some simple heuristic to reduce the effective
threshold for more balanced inputs, for which the Karatsuba algorithm
is more efficient.

Having a configurable threshold would allow people to tune for best
performance on their own platforms, and also it would make it easier
to write tests that hit the new code. As it stands, it's not obvious
how much of the new code is being hit by the existing tests.

Doing a quick test on my machine, using random equal-length inputs of
various sizes, I got the following performance results:

digits | rate (HEAD) | rate (patch) | change
--------+---------------+---------------+--------
10 | 6.060014e+06 | 6.0189365e+06 | -0.7%
100 | 2.7038752e+06 | 2.7287925e+06 | +0.9%
1000 | 88640.37 | 90504.82 | +2.1%
1500 | 39885.23 | 41041.504 | +2.9%
1600 | 36355.24 | 33368.28 | -8.2%
2000 | 23308.582 | 23105.932 | -0.9%
3000 | 10765.185 | 11360.11 | +5.5%
4000 | 6118.2554 | 6645.4116 | +8.6%
5000 | 3928.4985 | 4639.914 | +18.1%
10000 | 1003.80164 | 1431.9335 | +42.7%
20000 | 255.46135 | 456.23462 | +78.6%
30000 | 110.69313 | 226.53398 | +104.7%
40000 | 62.29333 | 148.12916 | +137.8%
50000 | 39.867493 | 95.16788 | +138.7%
60000 | 27.7672 | 74.01282 | +166.5%

The Karatsuba algorithm kicks in at 384*4 = 1536 decimal digits, so
presumably the variations below that are just noise, but this does
seem to suggest that KARATSUBA_BASE_LIMIT = 384 is too low for me, and
I'd probably want it to be something like 500-700.

There's another complication though (if the threshold is made
configurable): the various numeric functions that use mul_var() are
immutable, which means that the results from the Karatsuba algorithm
must match those from the schoolbook algorithm exactly, for all
inputs. That's not currently the case when computing approximate
results with a reduced rscale. That's fixable, but it's not obvious
whether or not the Karatsuba algorithm can actually be made beneficial
when computing such approximate results.

There's a wider question as to how many people use such big numeric
values -- i.e., how many people are actually going to benefit from
this? I don't have a good feel for that.

Regards,
Dean

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2024-06-29 15:25:30 Re: Optimize numeric.c mul_var() using the Karatsuba algorithm
Previous Message Alexander Lakhin 2024-06-29 12:00:00 Re: LogwrtResult contended spinlock