From: | John Naylor <johncnaylorls(at)gmail(dot)com> |
---|---|
To: | Nathan Bossart <nathandbossart(at)gmail(dot)com> |
Cc: | Ants Aasma <ants(at)cybertec(dot)at>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: add AVX2 support to simd.h |
Date: | 2024-03-21 04:30:30 |
Message-ID: | CANWCAZYfUv3iN2Vx--tmQP9WU6xwfU8d=gA7LoFXOYP9-wo8Hg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Mar 21, 2024 at 2:55 AM Nathan Bossart <nathandbossart(at)gmail(dot)com> wrote:
>
> On Wed, Mar 20, 2024 at 09:31:16AM -0500, Nathan Bossart wrote:
> > I don't mind removing the 2-register stuff if that's what you think we
> > should do. I'm cautiously optimistic that it'd help more than the extra
> > branch prediction might hurt, and it'd at least help avoid regressing the
> > lower end for the larger AVX2 registers, but I probably won't be able to
> > prove that without constructing another benchmark. And TBH I'm not sure
> > it'll significantly impact any real-world workload, anyway.
>
> Here's a new version of the patch set with the 2-register stuff removed,
I'm much happier about v5-0001. With a small tweak it would match what
I had in mind:
+ if (nelem < nelem_per_iteration)
+ goto one_by_one;
If this were "<=" then the for long arrays we could assume there is
always more than one block, and wouldn't need to check if any elements
remain -- first block, then a single loop and it's done.
The loop could also then be a "do while" since it doesn't have to
check the exit condition up front.
> plus a fresh run of the benchmark. The weird spike for AVX2 is what led me
> down the 2-register path earlier.
Yes, that spike is weird, because it seems super-linear. However, the
more interesting question for me is: AVX2 isn't really buying much for
the numbers covered in this test. Between 32 and 48 elements, and
between 64 and 80, it's indistinguishable from SSE2. The jumps to the
next shelf are postponed, but the jumps are just as high. From earlier
system benchmarks, I recall it eventually wins out with hundreds of
elements, right? Is that still true?
Further, now that the algorithm is more SIMD-appropriate, I wonder
what doing 4 registers at a time is actually buying us for either SSE2
or AVX2. It might just be a matter of scale, but that would be good to
understand.
From | Date | Subject | |
---|---|---|---|
Next Message | Euler Taveira | 2024-03-21 04:48:30 | Re: speed up a logical replica setup |
Previous Message | Euler Taveira | 2024-03-21 04:19:10 | Re: speed up a logical replica setup |