Re: Optimize mul_var() for var1ndigits >= 8

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: "Dean Rasheed" <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Optimize mul_var() for var1ndigits >= 8
Date: 2024-07-29 20:38:45
Message-ID: e9bd7e7d-5858-4aa5-a39c-9b6a2529b64c@app.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jul 29, 2024, at 22:01, Dean Rasheed wrote:
> On Mon, 29 Jul 2024 at 18:57, Joel Jacobson <joel(at)compiler(dot)org> wrote:
>>
>> Thanks to v3-0002, they are all still significantly faster when both patches have been applied,
>> but I wonder if it is expected or not, that v3-0001 temporarily made them a bit slower?
>>
>
> There's no obvious reason why 0001 would make those cases slower, but
> the fact that, together with 0002, it's a significant net win, and the
> gains for 5 and 6-digit inputs make it worthwhile, in my opinion.

Yes, I agree, I just thought it was noteworthy, but not a problem per se.

> Something I did notice in my tests was that if ndigits was a small
> multiple of 8, the old code was disproportionately faster, which can
> be explained by the fact that the computation fits exactly into a
> whole number of XMM register operations, with no remaining digits to
> process. For example, these results from above:
>
> ndigits1 | ndigits2 | PG17 rate | patch rate | % change
> ----------+----------+---------------+---------------+----------
> 15 | 15 | 3.7595882e+06 | 5.0751355e+06 | +34.99%
> 16 | 16 | 4.3353435e+06 | 4.970363e+06 | +14.65%
> 17 | 17 | 3.9258755e+06 | 4.935394e+06 | +25.71%
>
> 23 | 23 | 2.7975982e+06 | 4.5065035e+06 | +61.08%
> 24 | 24 | 3.2456168e+06 | 4.4578115e+06 | +37.35%
> 25 | 25 | 2.9515055e+06 | 4.0208335e+06 | +36.23%
>
> 31 | 31 | 2.169437e+06 | 3.7209152e+06 | +71.52%
> 32 | 32 | 2.5022498e+06 | 3.6609378e+06 | +46.31%
> 33 | 33 | 2.27133e+06 | 3.435459e+06 | +51.25%
>
> (Note how 16x16 was much faster than 15x15, for example.)
>
> The patched code seems to do a better job at levelling out and coping
> with arbitrary-sized inputs, not just those that fit exactly into a
> whole number of loops using SSE2 operations.

That's nice.

> Something else I noticed was that the relative gains for large numbers
> of digits were much higher with clang than with gcc:
>
> gcc 13.3.0:
>
> 16383 | 16383 | 21.629467 | 73.58552 | +240.21%
>
> clang 15.0.7:
>
> 16383 | 16383 | 11.562384 | 73.00517 | +531.40%
>
> That seems to be because clang doesn't do a good job of generating
> efficient SSE2 code in the old case of 16-bit x 16-bit
> multiplications. Looking on godbolt.org, it generates
> overly-complicated code using PMULUDQ, which actually does 32-bit x
> 32-bit multiplications. Gcc, on the other hand, generates a much more
> compact loop, using PMULHW and PMULLW, which is much faster. With the
> patch, they both generate the same SSE2 code, so the results are
> pretty consistent.

Very nice.

I've now also had an initial look at the actual code of the patches:

* v3-0001

Looks pretty straight forward, nice with the PRODSUM macros,
that really improved readability a lot.

I like these simplifications, how `var2ndigits` is used instead of `res_ndigits`:
- for (int i = res_ndigits - 3; i >= 1; i--)
+ for (int i = var2ndigits - 1; i >= 1; i--)

But I wonder why does `case 1:` not follow the same pattern?
for (int i = res_ndigits - 2; i >= 0; i--)

* v3-0002

I think it's non-obvious if the separate code paths for 32-bit and 64-bit,
using `#if SIZEOF_DATUM < 8`, to get *fast* 32-bit support, outweighs
the benefits of simpler code.

You brought up the question if 32-bit systems should be regarded
as legacy previously in this thread.

Unfortunately, we didn't get any feedback, so I'm starting a separate
thread, with subject "Is fast 32-bit code still important?", hoping to get
more input to help us make judgement calls.

/Joel

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joel Jacobson 2024-07-29 20:40:16 Is *fast* 32-bit support still important?
Previous Message Tom Lane 2024-07-29 20:18:32 Re: 040_pg_createsubscriber.pl is slow and unstable (was Re: speed up a logical replica setup)