From: | Nathan Bossart <nathandbossart(at)gmail(dot)com> |
---|---|
To: | "Amonson, Paul D" <paul(dot)d(dot)amonson(at)intel(dot)com> |
Cc: | Andres Freund <andres(at)anarazel(dot)de>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, "Shankaran, Akash" <akash(dot)shankaran(at)intel(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Popcount optimization using AVX512 |
Date: | 2024-03-07 21:36:00 |
Message-ID: | 20240307213600.GA563369@nathanxps13 |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
As promised...
> +# Check for Intel AVX512 intrinsics to do POPCNT calculations.
> +#
> +PGAC_AVX512_POPCNT_INTRINSICS([])
> +if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
> + PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq -mavx512f])
> +fi
> +AC_SUBST(CFLAGS_AVX512_POPCNT)
I'm curious why we need both -mavx512vpopcntdq and -mavx512f. On my
machine, -mavx512vpopcntdq alone is enough to pass this test, so if there
are other instructions required that need -mavx512f, then we might need to
expand the test.
> 13 files changed, 657 insertions(+), 119 deletions(-)
I still think it's worth breaking this change into at least 2 patches. In
particular, I think there's an opportunity to do the refactoring into
pg_popcnt_choose.c and pg_popcnt_x86_64_accel.c prior to adding the AVX512
stuff. These changes are likely straightforward, and getting them out of
the way early would make it easier to focus on the more interesting
changes. IMHO there are a lot of moving parts in this patch.
> +#undef HAVE__GET_CPUID_COUNT
> +
> +/* Define to 1 if you have immintrin. */
> +#undef HAVE__IMMINTRIN
Is this missing HAVE__CPUIDEX?
> uint64
> -pg_popcount(const char *buf, int bytes)
> +pg_popcount_slow(const char *buf, int bytes)
> {
> uint64 popcnt = 0;
>
> -#if SIZEOF_VOID_P >= 8
> +#if SIZEOF_VOID_P == 8
> /* Process in 64-bit chunks if the buffer is aligned. */
> if (buf == (const char *) TYPEALIGN(8, buf))
> {
> @@ -311,7 +224,7 @@ pg_popcount(const char *buf, int bytes)
>
> buf = (const char *) words;
> }
> -#else
> +#elif SIZEOF_VOID_P == 4
> /* Process in 32-bit chunks if the buffer is aligned. */
> if (buf == (const char *) TYPEALIGN(4, buf))
> {
Apologies for harping on this, but I'm still not seeing the need for these
SIZEOF_VOID_P changes. While it's unlikely that this makes any practical
difference, I see no reason to more strictly check SIZEOF_VOID_P here.
> + /* Process any remaining bytes */
> + while (bytes--)
> + popcnt += pg_number_of_ones[(unsigned char) *buf++];
> + return popcnt;
> +#else
> + return pg_popcount_slow(buf, bytes);
> +#endif /* USE_AVX512_CODE */
nitpick: Could we call pg_popcount_slow() in a common section for these
"remaining bytes?"
> +#if defined(_MSC_VER)
> + pg_popcount_indirect = pg_popcount512_fast;
> +#else
> + pg_popcount = pg_popcount512_fast;
> +#endif
These _MSC_VER sections are interesting. I'm assuming this is the
workaround for the MSVC linking issue you mentioned above. I haven't
looked too closely, but I wonder if the CRC32C code (see
src/include/port/pg_crc32c.h) is doing something different to avoid this
issue.
Upthread, Alvaro suggested a benchmark [0] that might be useful. I scanned
through this thread and didn't see any recent benchmark results for the
latest form of the patch. I think it's worth verifying that we are still
seeing the expected improvements.
[0] https://postgr.es/m/202402071953.5c4z7t6kl7ts%40alvherre.pgsql
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
From | Date | Subject | |
---|---|---|---|
Next Message | Nathan Bossart | 2024-03-07 22:00:47 | Re: postgres_fdw test timeouts |
Previous Message | Peter Geoghegan | 2024-03-07 21:07:28 | Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan |