Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

From: John Naylor <johncnaylorls(at)gmail(dot)com>
To: "Devulapalli, Raghuveer" <raghuveer(dot)devulapalli(at)intel(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, "Shankaran, Akash" <akash(dot)shankaran(at)intel(dot)com>
Subject: Re: Proposal for Updating CRC32C with AVX-512 Algorithm.
Date: 2024-12-09 11:39:36
Message-ID: CANWCAZbr4sO1bPoS+E=iRWnrBZp7zUKZEJk39KYt_Pu9+X1-SQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Dec 7, 2024 at 10:16 PM Devulapalli, Raghuveer
<raghuveer(dot)devulapalli(at)intel(dot)com> wrote:

> > There is another technique that computes CRC on 3 separate chunks and
> > combines them at the end, so about 3x faster on large-enough chunks.
> > That's the way used for the Arm proposal [0], coincidentally also citing a white
> > paper from Intel, but as Dimitry pointed out in that thread, its link has apparently
> > disappeared. Raghuveer, do you know about this, and is there another link
> > available?
> >
> > http://www.intel.com/content/dam/www/public/us/en/documents/white-
> > papers/crc-iscsi-polynomial-crc32-instruction-paper.pdf
>
> I am not aware of this paper. Let me poke a few people internally and get back to you on this.

Thanks! I have a portable PoC of how this works, but I'll save that
for another thread, since it's not Intel (or Arm) specific.

> > The cut off point in one implementation is only 144 bytes [1] , which is maybe not
> > as small as we'd like, but is quite a bit smaller than 256. That seems better suited
> > to our workloads, and more portable. I have a *brand-new* laptop with an Intel
> > chip, and IIUC it doesn't support AVX-512 because it uses a big-little architecture.
> > I also understand that Sierra Forrest (a server product line) will be all little cores
> > with no AVX-512 support, so I'm not sure why the proposal here requires AVX-
> > 512.
>
> AVX-512 is present all of Intel main P-core based Xeon and AMD's Zen4 and Zen5. Sierra Forest contains the SSE and AVX/AVX2 family ISA but AFAIK AVX/AVX2 does not contain any CRC32C specific instructions. See:

CRC32C was added in SSE 4.2, so it's quite old. The AVX-512 intrinsics
used in the patch are not CRC-specific, if I understand correctly.

My point was, it seems Intel still considers AVX-512 as optional, so
we can't count on it being present even in future chips. That's why
I'm interested in alternatives, at least as a first step. If we can
get 3x throughput, the calculation might bend up low enough in the
profile that going to 6x might not be noticeable (not sure).

> > > There a very frequent call computing COMP_CRC32C over just 20 bytes,
> > > while holding a crucial lock. If we were to do introduce something
> > > like this
> > > AVX-512 algorithm, it'd probably be worth to dispatch differently in
> > > case of compile-time known small lengths.
> >
> > I know you've read an earlier version of the patch and realized that it wouldn't
> > help here, but we could probably dispatch differently regardless, although it may
> > only be worth it if we can inline the instructions. Since we technically only need to
> > wait for xl_prev, I believe we could push the computation of the other 12 bytes to
> > before acquiring the lock, then only execute a single instruction on xl_prev to
> > complete the CRC computation. Is there any reason why we couldn't do that,
> > assuming we have a clean way to make that portable? That would mean that the
> > CRCs between major versions would be different, but I think we don't guarantee
> > that anyway.
>
> Not sure about that. This is not my expertise and I might need a little time to figure this out. Unfortunately, I am on travel with limited internet connection for the next 6 weeks. I will only be able to address this when I get back. Is this a blocker for the patch or is this something we can address as a revision?

This is orthogonal and is not related to the patch, since it doesn't
affect 8 and 20-byte paths, only 256 and greater.

--
John Naylor
Amazon Web Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2024-12-09 11:48:30 Re: Track the amount of time waiting due to cost_delay
Previous Message Shubham Khanna 2024-12-09 10:54:43 Adding a '--two-phase' option to 'pg_createsubscriber' utility.