Re: Improve CRC32C performance on SSE4.2

From: John Naylor <johncnaylorls(at)gmail(dot)com>
To: Nathan Bossart <nathandbossart(at)gmail(dot)com>
Cc: "Devulapalli, Raghuveer" <raghuveer(dot)devulapalli(at)intel(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "Shankaran, Akash" <akash(dot)shankaran(at)intel(dot)com>
Subject: Re: Improve CRC32C performance on SSE4.2
Date: 2025-03-04 05:09:09
Message-ID: CANWCAZbAvNB5cuht8T5wuap6JjwUaqyCezE+U5r4GhWYGYkmWw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 4, 2025 at 2:11 AM Nathan Bossart <nathandbossart(at)gmail(dot)com> wrote:

> I spent some time staring at pg_crc32.h with all these patches applied, and
> IIUC it leads to the following behavior:
>
> * For compiled-in SSE 4.2 builds, we branch based on the length. For
> smaller inputs, we are using an inlined version of the SSE 4.2 code.
> For larger inputs, we call a function pointer so that we can potentially
> use the PCLMUL version.

Right. For WAL, my hope is that the inlined path would balance out the
path with the function pointer, particularly since the computation for
the 20-byte header would be both inlined and unrolled, as I see here
in XLogInsertRecord():

crc32 rax,QWORD PTR [rsi]
crc32 rax,rbx ; <- newly calculated xl_prev
crc32 eax,DWORD PTR [rsi+0x10]

> This could potentially lead to a small
> regression for machines with SSE 4.2 but not PCLMUL, but that may be
> uncommon enough at this point to not worry aobut.

Note also upthread I mentioned we may have to go to 512-bit pclmul,
since Zen 2 regresses on 128-bit. :-(

I actually haven't seen any measurable difference with direct calls
versus indirect, but it could very well be that the microbenchmark is
hiding that since it's doing something unnatural by calling things a
bunch of times in a loop. I want to try changing the benchmark to base
the address it's computing on some bits from the crc from the last
loop iteration. I think that would make it more latency-sensitive. We
could also make it do an additional constant 20-byte input every time
to make it resemble WAL more closely.

> * For runtime-check SSE 4.2 builds, we choose slicing-by-8, SSE 4.2, or
> SSE 4.2 with PCLMUL, and we always use a function pointer.
>
> The main question I have is whether we can simplify this by always using a
> runtime check and by inlining slicing-by-8 for small inputs. That would be
> dependent on the performance of slicing-by-8 and SSE 4.2 being comparable
> for small inputs.

Slicing-by-8 needs one lookup and one XOR per byte of input, and other
overheads, so I think it would still be very slow.

> Overall, I wish we could avoid splitting things into separate files and
> adding more header file gymnastics, but maybe there isn't much better we
> can do without overhauling the CPU feature detection code.

Yeah, it seems all ideas so far have something unattractive about them. :-(

--
John Naylor
Amazon Web Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message John Naylor 2025-03-04 05:13:17 Re: Improve CRC32C performance on SSE4.2
Previous Message Shubham Khanna 2025-03-04 04:59:39 Re: Adding a '--clean-publisher-objects' option to 'pg_createsubscriber' utility.