Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Joel Jacobson <joel(at)compiler(dot)org>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Aaron Altman <aaronaltman(at)posteo(dot)net>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Optimize numeric.c mul_var() using the Karatsuba algorithm
Date: 2024-07-02 09:42:06
Message-ID: CAEZATCW9hzVVfoH8-OZL4+-q6PD-ztEkJdxuzC6kBY3f+RhJEA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, 30 Jun 2024 at 11:22, Joel Jacobson <joel(at)compiler(dot)org> wrote:
>
> On Sat, Jun 29, 2024, at 14:22, Dean Rasheed wrote:
>
> > But why just split it into two pieces? That will just lead
> > to a lot of unnecessary recursion for very unbalanced inputs. Instead,
> > why not split the longer input into N roughly equal sized pieces, each
> > around the same length as the shorter input, multiplying and adding
> > them at the appropriate offsets?
>
> The approach you're describing is implemented by e.g. CPython
> and is called "lopsided" in their code base. It has some different
> performance characteristics, compared to the recursive Half-Karatsuba
> approach.
>
> What I didn't like about lopsided is the degenerate case where the
> last chunk is much shorter than the var1, for example, if we pretend
> we would be doing Karatsuba all the way down to ndigits 2,
> and think about the example var1ndigits = 3 and var2ndigits = 10,
> then lopsided would do
> var1ndigits=3 var2ndigits=3
> var1ndigits=3 var2ndigits=3
> var1ndigits=3 var2ndigits=3
> var1ndigits=3 var2ndigits=1
>
> whereas Half-Karatsuba would do
> var1ndigits=3 var2ndigits=5
> var1ndigits=3 var2ndigits=5
>
> You can find contrary examples too of course where lopsided
> is better than Half-Karatsuba, none of them seem substantially better
> than the other.

Actually, that wasn't quite what I was thinking of. The approach I've
seen (I can't remember where) is to split var2 into roughly
equal-sized chunks as follows:

If var2ndigits >= 2 * var1ndigits, split it into a number chunks where

nchunks = var2ndigits / var1ndigits

using integer division with truncation. Then call mul_var_chunks()
(say), passing it nchunks, to perform the multiplication using that
many chunks.

mul_var_chunks() would use nchunks to decides on the chunk size according to

chunk_size = var2ndigits / nchunks

again using integer division with truncation. That has a remainder in
the range [0, nchunks), which is divided up between the chunks so that
some of them end up being chunk_size + 1 digits. I.e., something like:

chunk_remainder = var2ndigits - chunk_size * nchunks

for (int start = 0, end = chunk_size;
start < var2ndigits;
start = end, end = start + chunk_size)
{
/* Distribute remainder over the first few chunks */
if (chunk_remainder > 0)
{
end++;
chunk_remainder--;
}

/* Process chunk between "start" and "end" */
}

What this does is adapt the shape of the chunks to the region where
the Karatsuba algorithm is most efficient. For example, suppose
var1ndigits = 1000. Then:

For 2000x1000 to 2999x1000 digit products, nchunks will be 2 and
chunk_size will be at most (2999/2)+1 = 1500.

For 3000x1000 to 3999x1000 digit products, nchunks will be 3 and
chunk_size will be at most (3999/3)+1 = 1334.

And so on.

The result is that all the chunks will fall into that region where
var2ndigits / var1ndigits is between 1 and 1.5, for which
KARATSUBA_LOW_RANGE_CONDITION() will almost certainly pass, and
Karatsuba will operate efficiently all the way down to
KARATSUBA_BASE_LIMIT.

Regards,
Dean

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2024-07-02 09:44:07 Re: Avoid incomplete copy string (src/backend/access/transam/xlog.c)
Previous Message vignesh C 2024-07-02 09:39:59 Re: Logical Replication of sequences