From: | Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com> |
---|---|
To: | "Amonson, Paul D" <paul(dot)d(dot)amonson(at)intel(dot)com> |
Cc: | "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "Shankaran, Akash" <akash(dot)shankaran(at)intel(dot)com> |
Subject: | Re: Popcount optimization using AVX512 |
Date: | 2023-11-03 11:16:05 |
Message-ID: | CAEze2WjaFLhp7=Eo-mbSaHsoeq7ZEs00yZ2+FpSpijH+KN_hbA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, 2 Nov 2023 at 15:22, Amonson, Paul D <paul(dot)d(dot)amonson(at)intel(dot)com> wrote:
>
> This proposal showcases the speed-up provided to popcount feature when using AVX512 registers. The intent is to share the preliminary results with the community and get feedback for adding avx512 support for popcount.
>
> Revisiting the previous discussion/improvements around this feature, I have created a micro-benchmark based on the pg_popcount() in PostgreSQL's current implementations for x86_64 using the newer AVX512 intrinsics. Playing with this implementation has improved performance up to 46% on Intel's Sapphire Rapids platform on AWS. Such gains will benefit scenarios relying on popcount.
How does this compare to older CPUs, and more mixed workloads? IIRC,
the use of AVX512 (which I believe this instruction to be included in)
has significant implications for core clock frequency when those
instructions are being executed, reducing overall performance if
they're not a large part of the workload.
> My setup:
>
> Machine: AWS EC2 m7i - 16vcpu, 64gb RAM
> OS : Ubuntu 22.04
> GCC: 11.4 and 12.3 with flags "-mavx -mavx512vpopcntdq -mavx512vl -march=native -O2".
>
> 1. I copied the pg_popcount() implementation into a new C/C++ project using cmake/make.
> a. Software only and
> b. SSE 64 bit version
> 2. I created an implementation using the following AVX512 intrinsics:
> a. _mm512_popcnt_epi64()
> b. _mm512_reduce_add_epi64()
> 3. I tested random bit streams from 64 MiB to 1024 MiB in length (5 sizes; repeatable with RNG seed [std::mt19937_64])
Apart from the two type functions bytea_bit_count and bit_bit_count
(which are not accessed in postgres' own systems, but which could want
to cover bytestreams of >BLCKSZ) the only popcount usages I could find
were on objects that fit on a page, i.e. <8KiB in size. How does
performance compare for bitstreams of such sizes, especially after any
CPU clock implications are taken into account?
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
From | Date | Subject | |
---|---|---|---|
Next Message | Amit Kapila | 2023-11-03 11:31:43 | Re: Synchronizing slots from primary to standby |
Previous Message | Xiang Gao | 2023-11-03 10:46:57 | RE: CRC32C Parallel Computation Optimization on ARM |