From: | Nathan Bossart <nathandbossart(at)gmail(dot)com> |
---|---|
To: | Ants Aasma <ants(dot)aasma(at)cybertec(dot)at> |
Cc: | Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, "Amonson, Paul D" <paul(dot)d(dot)amonson(at)intel(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, David Rowley <dgrowleyml(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, "Shankaran, Akash" <akash(dot)shankaran(at)intel(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Popcount optimization using AVX512 |
Date: | 2024-04-05 15:38:11 |
Message-ID: | 20240405153811.GA9352@nathanxps13 |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, Apr 05, 2024 at 07:58:44AM -0500, Nathan Bossart wrote:
> On Fri, Apr 05, 2024 at 10:33:27AM +0300, Ants Aasma wrote:
>> The main issue I saw was that clang was able to peel off the first
>> iteration of the loop and then eliminate the mask assignment and
>> replace masked load with a memory operand for vpopcnt. I was not able
>> to convince gcc to do that regardless of optimization options.
>> Generated code for the inner loop:
>>
>> clang:
>> <L2>:
>> 50: add rdx, 64
>> 54: cmp rdx, rdi
>> 57: jae <L1>
>> 59: vpopcntq zmm1, zmmword ptr [rdx]
>> 5f: vpaddq zmm0, zmm1, zmm0
>> 65: jmp <L2>
>>
>> gcc:
>> <L1>:
>> 38: kmovq k1, rdx
>> 3d: vmovdqu8 zmm0 {k1} {z}, zmmword ptr [rax]
>> 43: add rax, 64
>> 47: mov rdx, -1
>> 4e: vpopcntq zmm0, zmm0
>> 54: vpaddq zmm0, zmm0, zmm1
>> 5a: vmovdqa64 zmm1, zmm0
>> 60: cmp rax, rsi
>> 63: jb <L1>
>>
>> I'm not sure how much that matters in practice. Attached is a patch to
>> do this manually giving essentially the same result in gcc. As most
>> distro packages are built using gcc I think it would make sense to
>> have the extra code if it gives a noticeable benefit for large cases.
>
> Yeah, I did see this, but I also wasn't sure if it was worth further
> complicating the code. I can test with and without your fix and see if it
> makes any difference in the benchmarks.
This seems to provide a small performance boost, so I've incorporated it
into v27.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachment | Content-Type | Size |
---|---|---|
v27-0001-AVX512-popcount-support.patch | text/x-diff | 29.5 KB |
v27-0002-optimize-visibilitymap_count-with-AVX512.patch | text/x-diff | 11.9 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Nathan Bossart | 2024-04-05 15:43:22 | Re: WIP Incremental JSON Parser |
Previous Message | Melanie Plageman | 2024-04-05 15:31:08 | Re: Streaming read-ready sequential scan code |