From: | Ants Aasma <ants(dot)aasma(at)cybertec(dot)at> |
---|---|
To: | Nathan Bossart <nathandbossart(at)gmail(dot)com> |
Cc: | Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, "Amonson, Paul D" <paul(dot)d(dot)amonson(at)intel(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, David Rowley <dgrowleyml(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, "Shankaran, Akash" <akash(dot)shankaran(at)intel(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Popcount optimization using AVX512 |
Date: | 2024-04-05 07:33:27 |
Message-ID: | CANwKhkMQtZCxa+nq=9QAoT6rgSQ48cVpH83tO3Md+-ck4bVz2w@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, 5 Apr 2024 at 07:15, Nathan Bossart <nathandbossart(at)gmail(dot)com> wrote:
> Here is an updated patch set. IMHO this is in decent shape and is
> approaching committable.
I checked the code generation on various gcc and clang versions. It
looks mostly fine starting from versions where avx512 is supported,
gcc-7.1 and clang-5.
The main issue I saw was that clang was able to peel off the first
iteration of the loop and then eliminate the mask assignment and
replace masked load with a memory operand for vpopcnt. I was not able
to convince gcc to do that regardless of optimization options.
Generated code for the inner loop:
clang:
<L2>:
50: add rdx, 64
54: cmp rdx, rdi
57: jae <L1>
59: vpopcntq zmm1, zmmword ptr [rdx]
5f: vpaddq zmm0, zmm1, zmm0
65: jmp <L2>
gcc:
<L1>:
38: kmovq k1, rdx
3d: vmovdqu8 zmm0 {k1} {z}, zmmword ptr [rax]
43: add rax, 64
47: mov rdx, -1
4e: vpopcntq zmm0, zmm0
54: vpaddq zmm0, zmm0, zmm1
5a: vmovdqa64 zmm1, zmm0
60: cmp rax, rsi
63: jb <L1>
I'm not sure how much that matters in practice. Attached is a patch to
do this manually giving essentially the same result in gcc. As most
distro packages are built using gcc I think it would make sense to
have the extra code if it gives a noticeable benefit for large cases.
The visibility map patch has the same issue, otherwise looks good.
Regards,
Ants Aasma
Attachment | Content-Type | Size |
---|---|---|
avx512-peel-first-iteration.patch | text/x-patch | 915 bytes |
From | Date | Subject | |
---|---|---|---|
Next Message | Bertrand Drouvot | 2024-04-05 07:43:58 | Re: Introduce XID age and inactive timeout based replication slot invalidation |
Previous Message | Amit Langote | 2024-04-05 07:09:29 | Re: remaining sql/json patches |