Re: Popcount optimization using AVX512

From: Nathan Bossart <nathandbossart(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: "Devulapalli, Raghuveer" <raghuveer(dot)devulapalli(at)intel(dot)com>, "Shankaran, Akash" <akash(dot)shankaran(at)intel(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, David Rowley <dgrowleyml(at)gmail(dot)com>, Ants Aasma <ants(dot)aasma(at)cybertec(dot)at>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, "Amonson, Paul D" <paul(dot)d(dot)amonson(at)intel(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Popcount optimization using AVX512
Date: 2024-07-30 21:32:07
Message-ID: Zqlb1_-BWlzVMbZv@nathan
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jul 30, 2024 at 02:07:01PM -0700, Andres Freund wrote:
> I've noticed that the configure probes for this are quite slow - pretty much
> the slowest step in a meson setup (and autoconf is similar). While looking
> into this, I also noticed that afaict the tests don't do the right thing for
> msvc.
>
> ...
> [6.825] Checking if "__sync_val_compare_and_swap(int64)" : links: YES
> [6.883] Checking if " __atomic_compare_exchange_n(int32)" : links: YES
> [6.940] Checking if " __atomic_compare_exchange_n(int64)" : links: YES
> [7.481] Checking if "XSAVE intrinsics without -mxsave" : links: NO
> [8.097] Checking if "XSAVE intrinsics with -mxsave" : links: YES
> [8.641] Checking if "AVX-512 popcount without -mavx512vpopcntdq -mavx512bw" : links: NO
> [9.183] Checking if "AVX-512 popcount with -mavx512vpopcntdq -mavx512bw" : links: YES
> [9.242] Checking if "_mm_crc32_u8 and _mm_crc32_u32 without -msse4.2" : links: NO
> [9.333] Checking if "_mm_crc32_u8 and _mm_crc32_u32 with -msse4.2" : links: YES
> [9.367] Checking if "x86_64: popcntq instruction" compiles: YES
> [9.382] Has header "atomic.h" : NO
> ...
>
> (the times here are a bit exaggerated, enabling them in meson also turns on
> python profiling, which makes everything a bit slower)
>
>
> Looks like this is largely the fault of including immintrin.h:
>
> echo -e '#include <immintrin.h>\nint main(){return _xgetbv(0) & 0xe0;}'|time gcc -mxsave -xc - -o /dev/null
> 0.45user 0.04system 0:00.50elapsed 99%CPU (0avgtext+0avgdata 94184maxresident)k
>
> echo -e '#include <immintrin.h>\n'|time gcc -c -mxsave -xc - -o /dev/null
> 0.43user 0.03system 0:00.46elapsed 99%CPU (0avgtext+0avgdata 86004maxresident)k

Interesting. Thanks for bringing this to my attention.

> Do we really need to link the generated programs? If we instead were able to
> just rely on the preprocessor, it'd be vastly faster.
>
> The __sync* and __atomic* checks actually need to link, as the compiler ends
> up generating calls to unimplemented functions if the compilation target
> doesn't support some operation natively - but I don't think that's true for
> the xsave/avx512 stuff
>
> Afaict we could just check for predefined preprocessor macros:
>
> echo|time gcc -c -mxsave -mavx512vpopcntdq -mavx512bw -xc -dM -E - -o -|grep -E '__XSAVE__|__AVX512BW__|__AVX512VPOPCNTDQ__'
> #define __AVX512BW__ 1
> #define __AVX512VPOPCNTDQ__ 1
> #define __XSAVE__ 1
> 0.00user 0.00system 0:00.00elapsed 100%CPU (0avgtext+0avgdata 13292maxresident)k
>
> echo|time gcc -c -march=nehalem -xc -dM -E - -o -|grep -E '__XSAVE__|__AVX512BW__|__AVX512VPOPCNTDQ__'
> 0.00user 0.00system 0:00.00elapsed 100%CPU (0avgtext+0avgdata 10972maxresident)k

Seems promising. I can't think of a reason that wouldn't work.

> Now, a reasonable counter-argument would be that only some of these macros are
> defined for msvc ([1]). However, as it turns out, the test is broken
> today, as msvc doesn't error out when using an intrinsic that's not
> "available" by the target architecture, it seems to assume that the caller did
> a cpuid check ahead of time.
>
>
> Check out [2], it shows the various predefined macros for gcc, clang and msvc.
>
>
> ISTM that the msvc checks for xsave/avx512 being broken should be an open
> item?

I'm not following this one. At the moment, we always do a runtime check
for the AVX-512 stuff, so in the worst case we'd check CPUID at startup and
set the function pointers appropriately, right? We could, of course, still
fix it, though.

--
nathan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2024-07-30 21:35:27 can we mark upper/lower/textlike functions leakproof?
Previous Message Andres Freund 2024-07-30 21:07:01 Re: Popcount optimization using AVX512