Re: Popcount optimization using AVX512

From: Nathan Bossart <nathandbossart(at)gmail(dot)com>
To: "Shankaran, Akash" <akash(dot)shankaran(at)intel(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, "Amonson, Paul D" <paul(dot)d(dot)amonson(at)intel(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Popcount optimization using AVX512
Date: 2023-11-15 21:48:53
Message-ID: 20231115214853.GC2448045@nathanxps13
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Nov 15, 2023 at 08:27:57PM +0000, Shankaran, Akash wrote:
> AVX512 has light and heavy instructions. While the heavy AVX512
> instructions have clock frequency implications, the light instructions
> not so much. See [0] for more details. We captured EMON data for the
> benchmark used in this work, and see that the instructions are using the
> licensing level not meant for heavy AVX512 operations. This means the
> instructions for popcount : _mm512_popcnt_epi64(),
> _mm512_reduce_add_epi64() are not going to have any significant impact on
> CPU clock frequency.
>
> Clock frequency impact aside, we measured the same benchmark for gains on
> older Intel hardware and observe up to 18% better performance on Intel
> Icelake. On older intel hardware, the popcntdq 512 instruction is not
> present so it won’t work. If clock frequency is not affected, rest of
> workload should not be impacted in the case of mixed workloads.

Thanks for sharing your analysis.

> Testing this on smaller block sizes < 8KiB shows that AVX512 compared to
> the current 64bit behavior shows slightly lower performance, but with a
> large variance. We cannot conclude much from it. The testing with ANALYZE
> benchmark by Nathan also points to no visible impact as a result of using
> AVX512. The gains on larger dataset is easily evident, with less
> variance.
>
> What are your thoughts if we introduce AVX512 popcount for smaller sizes
> as an optional feature initially, and then test it more thoroughly over
> time on this particular use case?

I don't see any need to rush this. At the very earliest, this feature
would go into v17, which doesn't enter feature freeze until April 2024.
That seems like enough time to complete any additional testing you'd like
to do. However, if you are seeing worse performance with this patch, then
it seems unlikely that we'd want to proceed.

> Thoughts or feedback on the approach in the patch? This solution should
> not impact anyone who doesn’t use the feature i.e. AVX512. Open to
> additional ideas if this doesn’t seem like the right approach here.

It's true that it wouldn't impact anyone not using the feature, but there's
also a decent chance that this code goes virtually untested. As I've
stated elsewhere [0], I think we should ensure there's buildfarm coverage
for this kind of architecture-specific stuff.

[0] https://postgr.es/m/20230726043707.GB3211130%40nathanxps13

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2023-11-15 22:18:29 Re: add log messages when replication slots become active and inactive (was Re: Is it worth adding ReplicationSlot active_pid to ReplicationSlotPersistentData?)
Previous Message Robert Haas 2023-11-15 21:32:48 Re: lazy_scan_heap() should release lock on buffer before vacuuming FSM