From: | "Joel Jacobson" <joel(at)compiler(dot)org> |
---|---|
To: | "Dean Rasheed" <dean(dot)a(dot)rasheed(at)gmail(dot)com> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Optimize mul_var() for var1ndigits >= 8 |
Date: | 2024-07-29 20:38:45 |
Message-ID: | e9bd7e7d-5858-4aa5-a39c-9b6a2529b64c@app.fastmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Jul 29, 2024, at 22:01, Dean Rasheed wrote:
> On Mon, 29 Jul 2024 at 18:57, Joel Jacobson <joel(at)compiler(dot)org> wrote:
>>
>> Thanks to v3-0002, they are all still significantly faster when both patches have been applied,
>> but I wonder if it is expected or not, that v3-0001 temporarily made them a bit slower?
>>
>
> There's no obvious reason why 0001 would make those cases slower, but
> the fact that, together with 0002, it's a significant net win, and the
> gains for 5 and 6-digit inputs make it worthwhile, in my opinion.
Yes, I agree, I just thought it was noteworthy, but not a problem per se.
> Something I did notice in my tests was that if ndigits was a small
> multiple of 8, the old code was disproportionately faster, which can
> be explained by the fact that the computation fits exactly into a
> whole number of XMM register operations, with no remaining digits to
> process. For example, these results from above:
>
> ndigits1 | ndigits2 | PG17 rate | patch rate | % change
> ----------+----------+---------------+---------------+----------
> 15 | 15 | 3.7595882e+06 | 5.0751355e+06 | +34.99%
> 16 | 16 | 4.3353435e+06 | 4.970363e+06 | +14.65%
> 17 | 17 | 3.9258755e+06 | 4.935394e+06 | +25.71%
>
> 23 | 23 | 2.7975982e+06 | 4.5065035e+06 | +61.08%
> 24 | 24 | 3.2456168e+06 | 4.4578115e+06 | +37.35%
> 25 | 25 | 2.9515055e+06 | 4.0208335e+06 | +36.23%
>
> 31 | 31 | 2.169437e+06 | 3.7209152e+06 | +71.52%
> 32 | 32 | 2.5022498e+06 | 3.6609378e+06 | +46.31%
> 33 | 33 | 2.27133e+06 | 3.435459e+06 | +51.25%
>
> (Note how 16x16 was much faster than 15x15, for example.)
>
> The patched code seems to do a better job at levelling out and coping
> with arbitrary-sized inputs, not just those that fit exactly into a
> whole number of loops using SSE2 operations.
That's nice.
> Something else I noticed was that the relative gains for large numbers
> of digits were much higher with clang than with gcc:
>
> gcc 13.3.0:
>
> 16383 | 16383 | 21.629467 | 73.58552 | +240.21%
>
> clang 15.0.7:
>
> 16383 | 16383 | 11.562384 | 73.00517 | +531.40%
>
> That seems to be because clang doesn't do a good job of generating
> efficient SSE2 code in the old case of 16-bit x 16-bit
> multiplications. Looking on godbolt.org, it generates
> overly-complicated code using PMULUDQ, which actually does 32-bit x
> 32-bit multiplications. Gcc, on the other hand, generates a much more
> compact loop, using PMULHW and PMULLW, which is much faster. With the
> patch, they both generate the same SSE2 code, so the results are
> pretty consistent.
Very nice.
I've now also had an initial look at the actual code of the patches:
* v3-0001
Looks pretty straight forward, nice with the PRODSUM macros,
that really improved readability a lot.
I like these simplifications, how `var2ndigits` is used instead of `res_ndigits`:
- for (int i = res_ndigits - 3; i >= 1; i--)
+ for (int i = var2ndigits - 1; i >= 1; i--)
But I wonder why does `case 1:` not follow the same pattern?
for (int i = res_ndigits - 2; i >= 0; i--)
* v3-0002
I think it's non-obvious if the separate code paths for 32-bit and 64-bit,
using `#if SIZEOF_DATUM < 8`, to get *fast* 32-bit support, outweighs
the benefits of simpler code.
You brought up the question if 32-bit systems should be regarded
as legacy previously in this thread.
Unfortunately, we didn't get any feedback, so I'm starting a separate
thread, with subject "Is fast 32-bit code still important?", hoping to get
more input to help us make judgement calls.
/Joel
From | Date | Subject | |
---|---|---|---|
Next Message | Joel Jacobson | 2024-07-29 20:40:16 | Is *fast* 32-bit support still important? |
Previous Message | Tom Lane | 2024-07-29 20:18:32 | Re: 040_pg_createsubscriber.pl is slow and unstable (was Re: speed up a logical replica setup) |