From: | Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> |
---|---|
To: | Joel Jacobson <joel(at)compiler(dot)org> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Aaron Altman <aaronaltman(at)posteo(dot)net>, pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: Optimize numeric.c mul_var() using the Karatsuba algorithm |
Date: | 2024-07-02 09:42:06 |
Message-ID: | CAEZATCW9hzVVfoH8-OZL4+-q6PD-ztEkJdxuzC6kBY3f+RhJEA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Sun, 30 Jun 2024 at 11:22, Joel Jacobson <joel(at)compiler(dot)org> wrote:
>
> On Sat, Jun 29, 2024, at 14:22, Dean Rasheed wrote:
>
> > But why just split it into two pieces? That will just lead
> > to a lot of unnecessary recursion for very unbalanced inputs. Instead,
> > why not split the longer input into N roughly equal sized pieces, each
> > around the same length as the shorter input, multiplying and adding
> > them at the appropriate offsets?
>
> The approach you're describing is implemented by e.g. CPython
> and is called "lopsided" in their code base. It has some different
> performance characteristics, compared to the recursive Half-Karatsuba
> approach.
>
> What I didn't like about lopsided is the degenerate case where the
> last chunk is much shorter than the var1, for example, if we pretend
> we would be doing Karatsuba all the way down to ndigits 2,
> and think about the example var1ndigits = 3 and var2ndigits = 10,
> then lopsided would do
> var1ndigits=3 var2ndigits=3
> var1ndigits=3 var2ndigits=3
> var1ndigits=3 var2ndigits=3
> var1ndigits=3 var2ndigits=1
>
> whereas Half-Karatsuba would do
> var1ndigits=3 var2ndigits=5
> var1ndigits=3 var2ndigits=5
>
> You can find contrary examples too of course where lopsided
> is better than Half-Karatsuba, none of them seem substantially better
> than the other.
Actually, that wasn't quite what I was thinking of. The approach I've
seen (I can't remember where) is to split var2 into roughly
equal-sized chunks as follows:
If var2ndigits >= 2 * var1ndigits, split it into a number chunks where
nchunks = var2ndigits / var1ndigits
using integer division with truncation. Then call mul_var_chunks()
(say), passing it nchunks, to perform the multiplication using that
many chunks.
mul_var_chunks() would use nchunks to decides on the chunk size according to
chunk_size = var2ndigits / nchunks
again using integer division with truncation. That has a remainder in
the range [0, nchunks), which is divided up between the chunks so that
some of them end up being chunk_size + 1 digits. I.e., something like:
chunk_remainder = var2ndigits - chunk_size * nchunks
for (int start = 0, end = chunk_size;
start < var2ndigits;
start = end, end = start + chunk_size)
{
/* Distribute remainder over the first few chunks */
if (chunk_remainder > 0)
{
end++;
chunk_remainder--;
}
/* Process chunk between "start" and "end" */
}
What this does is adapt the shape of the chunks to the region where
the Karatsuba algorithm is most efficient. For example, suppose
var1ndigits = 1000. Then:
For 2000x1000 to 2999x1000 digit products, nchunks will be 2 and
chunk_size will be at most (2999/2)+1 = 1500.
For 3000x1000 to 3999x1000 digit products, nchunks will be 3 and
chunk_size will be at most (3999/3)+1 = 1334.
And so on.
The result is that all the chunks will fall into that region where
var2ndigits / var1ndigits is between 1 and 1.5, for which
KARATSUBA_LOW_RANGE_CONDITION() will almost certainly pass, and
Karatsuba will operate efficiently all the way down to
KARATSUBA_BASE_LIMIT.
Regards,
Dean
From | Date | Subject | |
---|---|---|---|
Next Message | Daniel Gustafsson | 2024-07-02 09:44:07 | Re: Avoid incomplete copy string (src/backend/access/transam/xlog.c) |
Previous Message | vignesh C | 2024-07-02 09:39:59 | Re: Logical Replication of sequences |