Quick Links

Re: Improve CRC32C performance on SSE4.2

From:	John Naylor <johncnaylorls(at)gmail(dot)com>
To:	Nathan Bossart <nathandbossart(at)gmail(dot)com>
Cc:	"Devulapalli, Raghuveer" <raghuveer(dot)devulapalli(at)intel(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "Shankaran, Akash" <akash(dot)shankaran(at)intel(dot)com>
Subject:	Re: Improve CRC32C performance on SSE4.2
Date:	2025-03-05 01:51:21
Message-ID:	CANWCAZYRhLHArpyfV4uRK-Rw9N5oV5HMkkKtBehcuTjNOMwCZg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Mar 5, 2025 at 12:36 AM Nathan Bossart <nathandbossart(at)gmail(dot)com> wrote:
>
> On Tue, Mar 04, 2025 at 12:09:09PM +0700, John Naylor wrote:
> > On Tue, Mar 4, 2025 at 2:11 AM Nathan Bossart <nathandbossart(at)gmail(dot)com> wrote:
> >> This could potentially lead to a small regression for machines with SSE
> >> 4.2 but not PCLMUL, but that may be uncommon enough at this point to not
> >> worry aobut.
> >
> > Note also upthread I mentioned we may have to go to 512-bit pclmul,
> > since Zen 2 regresses on 128-bit. :-(
>
> Ah, okay. You mean the AVX-512 version [0]?

Right, except not that version, rather a more efficient way and with
only one accumulator, so still a minimum length of 64 bytes. I'll
share that once we have agreement on detection/dispatch.

> And are you thinking we'd use
> the same strategy for the compiled-in-SSE4.2 builds, i.e., inline the
> SSE4.2 version for small inputs and use a function pointer for larger ones?

Yes. Although, we may not even have to inline for non-constant input,
see below. Inlining loops does take binary space.

> > I actually haven't seen any measurable difference with direct calls
> > versus indirect, but it could very well be that the microbenchmark is
> > hiding that since it's doing something unnatural by calling things a
> > bunch of times in a loop. I want to try changing the benchmark to base
> > the address it's computing on some bits from the crc from the last
> > loop iteration. I think that would make it more latency-sensitive. We
> > could also make it do an additional constant 20-byte input every time
> > to make it resemble WAL more closely.
>
> Looking back on some old benchmarks for small-ish inputs [0], the
> difference does seem within the noise range. I suppose these functions
> might be expensive enough to make the function pointer overhead negligible.
> IME there's a big difference when a function pointer is used for an
> instruction or two [2], but even relatively small inputs to the CRC-32C
> functions might require several instructions.

That was my hunch too, but I wanted to be more sure, so I modified the
benchmark so it doesn't know the address of the next calculation until
it finishes the last calculation so we can hopefully see the latency
caused by indirection. It also does an additional calculation on
constant 20 bytes, like the WAL header. I also tweaked the length each
iteration so the branch predictor maybe has a harder time predicting
the constant 20 input. And to make it more challenging, I removed the
part that inlined all small inputs, so it inlines only constant
inputs:

0001+0002 (test only)

func pointer:

32
latency average = 24.021 ms
latency average = 24.020 ms
latency average = 23.733 ms
40
latency average = 25.018 ms
latency average = 24.253 ms
latency average = 24.278 ms
48
latency average = 25.437 ms
latency average = 24.817 ms
latency average = 24.793 ms

SSE4.2 build (direct func):

32
latency average = 22.422 ms
latency average = 22.387 ms
latency average = 22.391 ms
40
latency average = 23.444 ms
latency average = 22.887 ms
latency average = 22.988 ms
48
latency average = 23.432 ms
latency average = 23.380 ms
latency average = 23.384 ms

0001-0006
SSE 4.2 build (inlined constant / otherwise func pointer)

32
latency average = 22.135 ms
latency average = 21.874 ms
latency average = 21.910 ms
40
latency average = 22.916 ms
latency average = 23.086 ms
latency average = 22.422 ms
48
latency average = 23.255 ms
latency average = 22.780 ms
latency average = 22.804 ms

These are still a bit noisy, and close, but, it seems there is no
penalty in using the function pointer as long as the header
calculation is inlined.

--
John Naylor
Amazon Web Services

In response to

Re: Improve CRC32C performance on SSE4.2 at 2025-03-04 17:36:09 from Nathan Bossart

Responses

Re: Improve CRC32C performance on SSE4.2 at 2025-03-05 15:52:22 from Nathan Bossart

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Richard Guo	2025-03-05 02:02:41	Wrong results with subquery pullup and grouping sets
Previous Message	James Hunter	2025-03-05 01:47:16	Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators