From: | Ants Aasma <ants(dot)aasma(at)cybertec(dot)at> |
---|---|
To: | Nathan Bossart <nathandbossart(at)gmail(dot)com> |
Cc: | Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, "Amonson, Paul D" <paul(dot)d(dot)amonson(at)intel(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, David Rowley <dgrowleyml(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, "Shankaran, Akash" <akash(dot)shankaran(at)intel(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Popcount optimization using AVX512 |
Date: | 2024-04-01 21:11:59 |
Message-ID: | CANwKhkM-YZGE527y00LPU1660GW6zicvX7a=ZOsxnYYU0hng3g@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, 1 Apr 2024 at 18:53, Nathan Bossart <nathandbossart(at)gmail(dot)com> wrote:
>
> On Mon, Apr 01, 2024 at 01:06:12PM +0200, Alvaro Herrera wrote:
> > On 2024-Mar-31, Nathan Bossart wrote:
> >> + popcnt = _mm512_reduce_add_epi64(accum);
> >> + return popcnt + pg_popcount_fast(buf, bytes);
> >
> > Hmm, doesn't this arrangement cause an extra function call to
> > pg_popcount_fast to be used here? Given the level of micro-optimization
> > being used by this code, I would have thought that you'd have tried to
> > avoid that. (At least, maybe avoid the call if bytes is 0, no?)
>
> Yes, it does. I did another benchmark on very small arrays and can see the
> overhead. This is the time in milliseconds to run pg_popcount() on an
> array 1 billion times:
>
> size (bytes) HEAD AVX512-POPCNT
> 1 1707.685 3480.424
> 2 1926.694 4606.182
> 4 3210.412 5284.506
> 8 1920.703 3640.968
> 16 2936.91 4045.586
> 32 3627.956 5538.418
> 64 5347.213 3748.212
>
> I suspect that anything below 64 bytes will see this regression, as that is
> the earliest point where there are enough bytes for ZMM registers.
What about using the masking capabilities of AVX-512 to handle the
tail in the same code path? Masked out portions of a load instruction
will not generate an exception. To allow byte level granularity
masking, -mavx512bw is needed. Based on wikipedia this will only
disable this fast path on Knights Mill (Xeon Phi), in all other cases
VPOPCNTQ implies availability of BW.
Attached is an example of what I mean. I did not have a machine to
test it with, but the code generated looks sane. I added the clang
pragma because it insisted on unrolling otherwise and based on how the
instruction dependencies look that is probably not too helpful even
for large cases (needs to be tested). The configure check and compile
flags of course need to be amended for BW.
Regards,
Ants Aasma
Attachment | Content-Type | Size |
---|---|---|
avx512-popcnt-masked-tail.patch | text/x-patch | 1.2 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Corey Huinker | 2024-04-01 21:15:25 | Re: Statistics Import and Export |
Previous Message | Tom Lane | 2024-04-01 21:09:05 | Re: Statistics Import and Export |