From: | John Naylor <johncnaylorls(at)gmail(dot)com> |
---|---|
To: | "Devulapalli, Raghuveer" <raghuveer(dot)devulapalli(at)intel(dot)com> |
Cc: | "Sterrett, Matthew" <matthewsterrett2(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Bruce Momjian <bruce(at)momjian(dot)us>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "Shankaran, Akash" <akash(dot)shankaran(at)intel(dot)com> |
Subject: | Re: Proposal for Updating CRC32C with AVX-512 Algorithm. |
Date: | 2025-01-29 06:46:11 |
Message-ID: | CANWCAZbnvZjFEYzMfsZ5KFCG1oMvod7dZiY+P_8bwmOMPkHChQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Sat, Jan 25, 2025 at 3:35 AM Devulapalli, Raghuveer
<raghuveer(dot)devulapalli(at)intel(dot)com> wrote:
> > #1 - The choice of AVX-512. There is no such thing as a "CRC instruction operating
> > on 8 bytes", and the proposed algorithm is a multistep process using carryless
> > multiplication and requiring at least 256 bytes of input. The Chromium sources
> > cited as the source for this patch also contain an implementation using 128-bit
> > instructions, and which only requires at least 64 bytes of input. Is there a reason
> > that not tested or proposed as well? That would be much easier to read/maintain,
> > work on more systems, and might give a speed boost on smaller inputs. These are
> > useful properties to have.
> >
> > https://github.com/chromium/chromium/blob/main/third_party/zlib/crc32_simd
> > .c#L215
>
> Agreed. postgres already has the SSE42 version pg_comp_crc32c_sse42, but I didn’t
> realize it uses the crc32 instruction which processes only 8 bytes at a time. This can
> certainly be upgraded to process 64bytes at a time and should be faster. Since most
> of the AVX-512 stuff is almost ready, I propose to do this in a follow up patch immediately.
It doesn't make sense to me that more limited/difficult hardware
support (and more complex coding for that) and a larger input
threshold should be a prerequisite for something that doesn't have
these disadvantages.
> Let me know if you disagree. The AVX512 version processes 256 bytes at a time and will
> most certainly be faster than the improved SSE42 version, which is why the chromium
> library has both AVX512 and SSE42.
It looks like chromium simply vendored the zlib library. Input
destined for compression is always going to be "large". That's not
true in general for our use case, and we mentioned that fact seven
months ago, when Andres said upthread [1]: "This is extremely workload
dependent, it's not hard to find workloads with lots of very small
record and very few big ones...". Given that feedback, it would have
made a lot of sense to mention the 64-byte alternative back then,
especially since it's the exact same pclmull algorithm based on the
same paper, and is found in the same zlib .c file, but for some reason
that was not done.
More broadly, the best strategy is to start with the customer and work
backward to the technology. It's more risky to pick the technology
upfront and try to find ways to use it. My goal here is to help you
make the right tradeoffs. Here's my view:
1. If we can have a relatively low input size threshold for
improvement, it's possibly worth a bit of additional complexity in
configure and run-time checks. There is a complicating factor in
testing that though: the latency of carryless multiplication
instructions varies drastically on different microarchitectures.
2. If we can improve large inputs in a simple fashion, with no
additional hardware support, that's worth doing in any case.
3. Complex hardware support (6 CPUIDs!) that only works on large
inputs (a minority of workloads) looks to be the worst of both worlds
and it's not the tradeoff we should make.
Further, we verified upthread that Intel's current and near-future
product line includes server chips (some with over 100 cores, so not
exactly low-end) that don't support AVX-512 at all. I have no idea how
common they will be, but they will certainly be found in cloud
datacenters somewhere. Shouldn't we have an answer for them as well?
> > #2 - The legal status of the algorithm from following Intel white paper, which is
> > missing from its original location, archived here:
> >
> > https://web.archive.org/web/20220802143127/https://www.intel.com/content/
> > dam/www/public/us/en/documents/white-papers/crc-iscsi-polynomial-crc32-
> > instruction-paper.pdf
> >
> > https://github.com/torvalds/linux/blob/master/arch/x86/crypto/crc32c-pcl-intel-
> > asm_64.S
> >
> > ...so I'm unclear if these patents are applicable to software implementations.
> > They also seem to be expired, but I am not a lawyer.
> > Could you look into this please? Even if we do end up with AVX-512, this would be
> > a good fallback.
>
> Given that SSE42 is pretty much available in all x86 processors at this point, do we need a
> fallback C version specially after we improve the SSE42 version.
I know you had extended time off work, but I've already shared my
findings and explained my reasoning [2]. The title of the paper is
"Fast CRC Computation for iSCSI Polynomial Using CRC32 Instruction",
so unsurprisingly it does improve the SSE42 version. With a few dozen
lines of code, I can get ~3x speedup on page-sized inputs. At the very
least we want to use this technique on Arm [3], and the only blocker
now is the question regarding the patents. I'm interested to hear the
response on this.
[1] https://www.postgresql.org/message-id/20240612201135.kk77tiqcux77lgev%40awork3.anarazel.de
[2] https://www.postgresql.org/message-id/CANWCAZbr4sO1bPoS%2BE%3DiRWnrBZp7zUKZEJk39KYt_Pu9%2BX1-SQ%40mail.gmail.com
[3] https://commitfest.postgresql.org/51/4620/
--
John Naylor
Amazon Web Services
From | Date | Subject | |
---|---|---|---|
Next Message | Jacob Brazeal | 2025-01-29 06:46:35 | Comment cleanup - it's vs its |
Previous Message | Paul Jungwirth | 2025-01-29 06:34:33 | Re: SQL:2011 application time |