From: | John Naylor <johncnaylorls(at)gmail(dot)com> |
---|---|
To: | Nathan Bossart <nathandbossart(at)gmail(dot)com> |
Cc: | "Chiranmoy(dot)Bhattacharya(at)fujitsu(dot)com" <Chiranmoy(dot)Bhattacharya(at)fujitsu(dot)com>, "Malladi, Rama" <ramamalladi(at)hotmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, "Ragesh(dot)Hajela(at)fujitsu(dot)com" <Ragesh(dot)Hajela(at)fujitsu(dot)com>, Salvatore Dipietro <dipiets(at)amazon(dot)com>, "Devanga(dot)Susmitha(at)fujitsu(dot)com" <Devanga(dot)Susmitha(at)fujitsu(dot)com> |
Subject: | Re: [PATCH] SVE popcount support |
Date: | 2025-03-24 11:34:45 |
Message-ID: | CANWCAZbE6Q7MiKP1iLEcWFoJm3cPEFns1+C76yF4LxzqwQiC4w@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Sat, Mar 22, 2025 at 10:42 AM Nathan Bossart
<nathandbossart(at)gmail(dot)com> wrote:
> * 0002 introduces the Neon implementation, which conveniently doesn't need
> configure-time checks or function pointers. I noticed that some
> compilers (e.g., Apple clang 16) compile in Neon instructions already,
> but our hand-rolled implementation is better about instruction-level
> parallelism and seems to still be quite a bit faster.
+pg_popcount64(uint64 word)
+{
+ return vaddv_u8(vcnt_u8(vld1_u8((const uint8 *) &word)));
+}
This confused me until I found that this is what
__builtin_popcountl(word) would emit anyway. Worth a comment?
Some thoughts to consider, some speculative and maybe not worth
putting time into:
> I did add a 2-register block
> in the Neon implementation for processing the tail because I was worried
> about its performance on smaller buffers, but that part might get removed
> if I can't measure any difference.
Even if we can measure a difference on fixed-sized inputs, that might
not carry over when the branch is unpredictable.
+ /*
+ * Process remaining 8-byte blocks.
+ */
+ for (; bytes >= sizeof(uint64); bytes -= sizeof(uint64))
+ {
+ popcnt += pg_popcount64(*((uint64 *) buf));
+ buf += sizeof(uint64);
+ }
This uses 16-byte registers, but only loads 8-bytes at a time (with
accumulation work), then a bytewise tail up to 7 bytes. Alternatively,
you could instead do a loop over a single local accumulator, which I
think could have a short accumulation pipeline since 3 iterations
can't overflow 8-bit lanes. But then the bytewise tail could be up to
15 bytes.
> * 0003 introduces the SVE implementation. You'll notice I've moved all the
> function pointer gymnastics into the pg_popcount_aarch64.c file, which is
> where the Neon implementations live, too. I also tried to clean up the
> configure checks a bit. I imagine it's possible to make them more
> compact, but I felt that the enhanced readability was worth it.
I don't know what the configure checks looked like before, but I'm
confused that the loops are unrolled in the link-test functions as
well.
> * For both Neon and SVE, I do see improvements with looping over 4
> registers at a time, so IMHO it's worth doing so even if it performs the
> same as 2-register blocks on some hardware.
I wonder if alignment matters for these larger blocks.
--
John Naylor
Amazon Web Services
From | Date | Subject | |
---|---|---|---|
Next Message | John Naylor | 2025-03-24 11:37:42 | Re: Improve CRC32C performance on SSE4.2 |
Previous Message | jian he | 2025-03-24 11:14:27 | Re: support fast default for domain with constraints |