Re: [PATCH] SVE popcount support

From: John Naylor <johncnaylorls(at)gmail(dot)com>
To: Nathan Bossart <nathandbossart(at)gmail(dot)com>
Cc: "Chiranmoy(dot)Bhattacharya(at)fujitsu(dot)com" <Chiranmoy(dot)Bhattacharya(at)fujitsu(dot)com>, "Malladi, Rama" <ramamalladi(at)hotmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, "Ragesh(dot)Hajela(at)fujitsu(dot)com" <Ragesh(dot)Hajela(at)fujitsu(dot)com>, Salvatore Dipietro <dipiets(at)amazon(dot)com>, "Devanga(dot)Susmitha(at)fujitsu(dot)com" <Devanga(dot)Susmitha(at)fujitsu(dot)com>
Subject: Re: [PATCH] SVE popcount support
Date: 2025-03-24 11:34:45
Message-ID: CANWCAZbE6Q7MiKP1iLEcWFoJm3cPEFns1+C76yF4LxzqwQiC4w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Mar 22, 2025 at 10:42 AM Nathan Bossart
<nathandbossart(at)gmail(dot)com> wrote:

> * 0002 introduces the Neon implementation, which conveniently doesn't need
> configure-time checks or function pointers. I noticed that some
> compilers (e.g., Apple clang 16) compile in Neon instructions already,
> but our hand-rolled implementation is better about instruction-level
> parallelism and seems to still be quite a bit faster.

+pg_popcount64(uint64 word)
+{
+ return vaddv_u8(vcnt_u8(vld1_u8((const uint8 *) &word)));
+}

This confused me until I found that this is what
__builtin_popcountl(word) would emit anyway. Worth a comment?

Some thoughts to consider, some speculative and maybe not worth
putting time into:

> I did add a 2-register block
> in the Neon implementation for processing the tail because I was worried
> about its performance on smaller buffers, but that part might get removed
> if I can't measure any difference.

Even if we can measure a difference on fixed-sized inputs, that might
not carry over when the branch is unpredictable.

+ /*
+ * Process remaining 8-byte blocks.
+ */
+ for (; bytes >= sizeof(uint64); bytes -= sizeof(uint64))
+ {
+ popcnt += pg_popcount64(*((uint64 *) buf));
+ buf += sizeof(uint64);
+ }

This uses 16-byte registers, but only loads 8-bytes at a time (with
accumulation work), then a bytewise tail up to 7 bytes. Alternatively,
you could instead do a loop over a single local accumulator, which I
think could have a short accumulation pipeline since 3 iterations
can't overflow 8-bit lanes. But then the bytewise tail could be up to
15 bytes.

> * 0003 introduces the SVE implementation. You'll notice I've moved all the
> function pointer gymnastics into the pg_popcount_aarch64.c file, which is
> where the Neon implementations live, too. I also tried to clean up the
> configure checks a bit. I imagine it's possible to make them more
> compact, but I felt that the enhanced readability was worth it.

I don't know what the configure checks looked like before, but I'm
confused that the loops are unrolled in the link-test functions as
well.

> * For both Neon and SVE, I do see improvements with looping over 4
> registers at a time, so IMHO it's worth doing so even if it performs the
> same as 2-register blocks on some hardware.

I wonder if alignment matters for these larger blocks.

--
John Naylor
Amazon Web Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message John Naylor 2025-03-24 11:37:42 Re: Improve CRC32C performance on SSE4.2
Previous Message jian he 2025-03-24 11:14:27 Re: support fast default for domain with constraints