From: | Nathan Bossart <nathandbossart(at)gmail(dot)com> |
---|---|
To: | John Naylor <johncnaylorls(at)gmail(dot)com> |
Cc: | "Chiranmoy(dot)Bhattacharya(at)fujitsu(dot)com" <Chiranmoy(dot)Bhattacharya(at)fujitsu(dot)com>, "Malladi, Rama" <ramamalladi(at)hotmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, "Ragesh(dot)Hajela(at)fujitsu(dot)com" <Ragesh(dot)Hajela(at)fujitsu(dot)com>, Salvatore Dipietro <dipiets(at)amazon(dot)com>, "Devanga(dot)Susmitha(at)fujitsu(dot)com" <Devanga(dot)Susmitha(at)fujitsu(dot)com> |
Subject: | Re: [PATCH] SVE popcount support |
Date: | 2025-03-24 19:06:13 |
Message-ID: | Z-GtJcGVxDcYzsgK@nathan |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Mar 24, 2025 at 06:34:45PM +0700, John Naylor wrote:
> On Sat, Mar 22, 2025 at 10:42 AM Nathan Bossart
> <nathandbossart(at)gmail(dot)com> wrote:
>> * 0002 introduces the Neon implementation, which conveniently doesn't need
>> configure-time checks or function pointers. I noticed that some
>> compilers (e.g., Apple clang 16) compile in Neon instructions already,
>> but our hand-rolled implementation is better about instruction-level
>> parallelism and seems to still be quite a bit faster.
>
> +pg_popcount64(uint64 word)
> +{
> + return vaddv_u8(vcnt_u8(vld1_u8((const uint8 *) &word)));
> +}
>
> This confused me until I found that this is what
> __builtin_popcountl(word) would emit anyway. Worth a comment?
Sure thing.
> + /*
> + * Process remaining 8-byte blocks.
> + */
> + for (; bytes >= sizeof(uint64); bytes -= sizeof(uint64))
> + {
> + popcnt += pg_popcount64(*((uint64 *) buf));
> + buf += sizeof(uint64);
> + }
>
> This uses 16-byte registers, but only loads 8-bytes at a time (with
> accumulation work), then a bytewise tail up to 7 bytes. Alternatively,
> you could instead do a loop over a single local accumulator, which I
> think could have a short accumulation pipeline since 3 iterations
> can't overflow 8-bit lanes. But then the bytewise tail could be up to
> 15 bytes.
Yeah, I wasn't sure how far we wanted to go with this. We could do 4
registers at a time, then 2, then 1, then 8-bytes, then byte-by-byte, but
that's quite a few extra lines of code for the amount of gain, not to
mention the extra overhead. My inclination was to try to keep this as
simple as possible while making sure we didn't regress on small inputs.
>> * 0003 introduces the SVE implementation. You'll notice I've moved all the
>> function pointer gymnastics into the pg_popcount_aarch64.c file, which is
>> where the Neon implementations live, too. I also tried to clean up the
>> configure checks a bit. I imagine it's possible to make them more
>> compact, but I felt that the enhanced readability was worth it.
>
> I don't know what the configure checks looked like before, but I'm
> confused that the loops are unrolled in the link-test functions as
> well.
We do need the two separate blocks because they use different intrinsic
functions, but I could probably remove the actual "for" loops themselves to
simplify things a bit.
>> * For both Neon and SVE, I do see improvements with looping over 4
>> registers at a time, so IMHO it's worth doing so even if it performs the
>> same as 2-register blocks on some hardware.
>
> I wonder if alignment matters for these larger blocks.
Some earlier benchmarks didn't show anything outside of the noise range
[0].
[0] https://postgr.es/m/OSBPR01MB266403FD4C05DAB58EBBA82897EF2%40OSBPR01MB2664.jpnprd01.prod.outlook.com
--
nathan
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2025-03-24 19:11:19 | Re: vacuum_truncate configuration parameter and isset_offset |
Previous Message | Lukas Fittl | 2025-03-24 18:47:40 | Re: Proposal - Allow extensions to set a Plan Identifier |