Re: Popcount optimization using AVX512

From: Andres Freund <andres(at)anarazel(dot)de>
To: Nathan Bossart <nathandbossart(at)gmail(dot)com>
Cc: "Devulapalli, Raghuveer" <raghuveer(dot)devulapalli(at)intel(dot)com>, "Shankaran, Akash" <akash(dot)shankaran(at)intel(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, David Rowley <dgrowleyml(at)gmail(dot)com>, Ants Aasma <ants(dot)aasma(at)cybertec(dot)at>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, "Amonson, Paul D" <paul(dot)d(dot)amonson(at)intel(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Popcount optimization using AVX512
Date: 2024-07-30 21:07:01
Message-ID: 20240730210701.oyn2wr4zduu2efrg@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2024-04-23 11:02:07 -0500, Nathan Bossart wrote:
> On Thu, Apr 18, 2024 at 05:13:58PM -0500, Nathan Bossart wrote:
> > Makes sense, thanks. I'm planning to commit this fix sometime early next
> > week.
>
> Committed.

I've noticed that the configure probes for this are quite slow - pretty much
the slowest step in a meson setup (and autoconf is similar). While looking
into this, I also noticed that afaict the tests don't do the right thing for
msvc.

...
[6.825] Checking if "__sync_val_compare_and_swap(int64)" : links: YES
[6.883] Checking if " __atomic_compare_exchange_n(int32)" : links: YES
[6.940] Checking if " __atomic_compare_exchange_n(int64)" : links: YES
[7.481] Checking if "XSAVE intrinsics without -mxsave" : links: NO
[8.097] Checking if "XSAVE intrinsics with -mxsave" : links: YES
[8.641] Checking if "AVX-512 popcount without -mavx512vpopcntdq -mavx512bw" : links: NO
[9.183] Checking if "AVX-512 popcount with -mavx512vpopcntdq -mavx512bw" : links: YES
[9.242] Checking if "_mm_crc32_u8 and _mm_crc32_u32 without -msse4.2" : links: NO
[9.333] Checking if "_mm_crc32_u8 and _mm_crc32_u32 with -msse4.2" : links: YES
[9.367] Checking if "x86_64: popcntq instruction" compiles: YES
[9.382] Has header "atomic.h" : NO
...

(the times here are a bit exaggerated, enabling them in meson also turns on
python profiling, which makes everything a bit slower)

Looks like this is largely the fault of including immintrin.h:

echo -e '#include <immintrin.h>\nint main(){return _xgetbv(0) & 0xe0;}'|time gcc -mxsave -xc - -o /dev/null
0.45user 0.04system 0:00.50elapsed 99%CPU (0avgtext+0avgdata 94184maxresident)k

echo -e '#include <immintrin.h>\n'|time gcc -c -mxsave -xc - -o /dev/null
0.43user 0.03system 0:00.46elapsed 99%CPU (0avgtext+0avgdata 86004maxresident)k

Do we really need to link the generated programs? If we instead were able to
just rely on the preprocessor, it'd be vastly faster.

The __sync* and __atomic* checks actually need to link, as the compiler ends
up generating calls to unimplemented functions if the compilation target
doesn't support some operation natively - but I don't think that's true for
the xsave/avx512 stuff

Afaict we could just check for predefined preprocessor macros:

echo|time gcc -c -mxsave -mavx512vpopcntdq -mavx512bw -xc -dM -E - -o -|grep -E '__XSAVE__|__AVX512BW__|__AVX512VPOPCNTDQ__'
#define __AVX512BW__ 1
#define __AVX512VPOPCNTDQ__ 1
#define __XSAVE__ 1
0.00user 0.00system 0:00.00elapsed 100%CPU (0avgtext+0avgdata 13292maxresident)k

echo|time gcc -c -march=nehalem -xc -dM -E - -o -|grep -E '__XSAVE__|__AVX512BW__|__AVX512VPOPCNTDQ__'
0.00user 0.00system 0:00.00elapsed 100%CPU (0avgtext+0avgdata 10972maxresident)k

Now, a reasonable counter-argument would be that only some of these macros are
defined for msvc ([1]). However, as it turns out, the test is broken
today, as msvc doesn't error out when using an intrinsic that's not
"available" by the target architecture, it seems to assume that the caller did
a cpuid check ahead of time.

Check out [2], it shows the various predefined macros for gcc, clang and msvc.

ISTM that the msvc checks for xsave/avx512 being broken should be an open
item?

Greetings,

Andres

[1] https://learn.microsoft.com/en-us/cpp/preprocessor/predefined-macros?view=msvc-170
[2] https://godbolt.org/z/c8Kj8r3PK

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2024-07-30 21:32:07 Re: Popcount optimization using AVX512
Previous Message Pavel Stehule 2024-07-30 20:56:19 Re: proposal: schema variables