Re: Popcount optimization using AVX512

From: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To: "Amonson, Paul D" <paul(dot)d(dot)amonson(at)intel(dot)com>
Cc: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "Shankaran, Akash" <akash(dot)shankaran(at)intel(dot)com>
Subject: Re: Popcount optimization using AVX512
Date: 2023-11-03 11:16:05
Message-ID: CAEze2WjaFLhp7=Eo-mbSaHsoeq7ZEs00yZ2+FpSpijH+KN_hbA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 2 Nov 2023 at 15:22, Amonson, Paul D <paul(dot)d(dot)amonson(at)intel(dot)com> wrote:
>
> This proposal showcases the speed-up provided to popcount feature when using AVX512 registers. The intent is to share the preliminary results with the community and get feedback for adding avx512 support for popcount.
>
> Revisiting the previous discussion/improvements around this feature, I have created a micro-benchmark based on the pg_popcount() in PostgreSQL's current implementations for x86_64 using the newer AVX512 intrinsics. Playing with this implementation has improved performance up to 46% on Intel's Sapphire Rapids platform on AWS. Such gains will benefit scenarios relying on popcount.

How does this compare to older CPUs, and more mixed workloads? IIRC,
the use of AVX512 (which I believe this instruction to be included in)
has significant implications for core clock frequency when those
instructions are being executed, reducing overall performance if
they're not a large part of the workload.

> My setup:
>
> Machine: AWS EC2 m7i - 16vcpu, 64gb RAM
> OS : Ubuntu 22.04
> GCC: 11.4 and 12.3 with flags "-mavx -mavx512vpopcntdq -mavx512vl -march=native -O2".
>
> 1. I copied the pg_popcount() implementation into a new C/C++ project using cmake/make.
> a. Software only and
> b. SSE 64 bit version
> 2. I created an implementation using the following AVX512 intrinsics:
> a. _mm512_popcnt_epi64()
> b. _mm512_reduce_add_epi64()
> 3. I tested random bit streams from 64 MiB to 1024 MiB in length (5 sizes; repeatable with RNG seed [std::mt19937_64])

Apart from the two type functions bytea_bit_count and bit_bit_count
(which are not accessed in postgres' own systems, but which could want
to cover bytestreams of >BLCKSZ) the only popcount usages I could find
were on objects that fit on a page, i.e. <8KiB in size. How does
performance compare for bitstreams of such sizes, especially after any
CPU clock implications are taken into account?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2023-11-03 11:31:43 Re: Synchronizing slots from primary to standby
Previous Message Xiang Gao 2023-11-03 10:46:57 RE: CRC32C Parallel Computation Optimization on ARM