Re: Improve CRC32C performance on SSE4.2

From: John Naylor <johncnaylorls(at)gmail(dot)com>
To: Nathan Bossart <nathandbossart(at)gmail(dot)com>
Cc: "Devulapalli, Raghuveer" <raghuveer(dot)devulapalli(at)intel(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "Shankaran, Akash" <akash(dot)shankaran(at)intel(dot)com>
Subject: Re: Improve CRC32C performance on SSE4.2
Date: 2025-02-18 06:40:13
Message-ID: CANWCAZaD5niydBF6q3V_cjApNV05cw-LpxxFtMbwDPLsz-PjbQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 18, 2025 at 12:41 AM Nathan Bossart
<nathandbossart(at)gmail(dot)com> wrote:
>
> On Mon, Feb 17, 2025 at 05:58:01PM +0700, John Naylor wrote:
> > I tried using branching for the runtime check, and this looks like the
> > way to go:
> > - Existing -msse4.2 builders will still call directly, but inside the
> > function there is a length check and only for long input will it do a
> > runtime check for pclmul.
> > - This smooths the way for -msse4.2 (and the equivalent on Arm) to
> > inline calls with short constant input (e.g. WAL insert lock),
> > although I've not done that here.
> > - This can be a simple starting point for consolidating runtime
> > checks, as was proposed for popcount in the AVX-512 CRC thread, but
> > with branching my model was Andres' sketch here:
> >
> > https://www.postgresql.org/message-id/20240731023918.ixsfbeuub6e76one%40awork3.anarazel.de
>
> Oh, nifty. IIUC this should help avoid quite a bit of overhead even before
> adding the PCLMUL stuff since it removes the function pointers for the
> runtime-check builds.

I figured the same, but it doesn't seem to help on the microbenchmark.
(I also tried the pg_waldump benchmark you used, but I couldn't get it
working so I'm probably missing a step.)

master:

20
latency average = 3.958 ms
latency average = 3.860 ms
latency average = 3.857 ms
32
latency average = 5.921 ms
latency average = 5.166 ms
latency average = 5.128 ms
48
latency average = 6.384 ms
latency average = 6.022 ms
latency average = 5.991 ms

v6:

20
latency average = 4.084 ms
latency average = 3.879 ms
latency average = 3.896 ms
32
latency average = 5.533 ms
latency average = 5.536 ms
latency average = 5.573 ms
48
latency average = 6.201 ms
latency average = 6.205 ms
latency average = 6.111 ms

> While this needn't block this patch set, I do find the dispatch code to be
> pretty complicated. Maybe we can improve that in the future by using
> macros to auto-generate much of it. My concern here is less about this
> particular patch set and more about the long term maintainability as we add
> more and more stuff like it, each with its own tangled web of build and
> dispatch rules.

I think the runtime dispatch is fairly simple for this case. To my
taste, the worse maintainability hazard is the building. To make that
better, I'd do this:
- Rename the CRC choose*.c files to pg_cpucap_{x86,arm}.c and build
them unconditionally for each platform
- Initialize the runtime info by CPU platform and not other symbols
where possible (I guess anything needing AVX-512 will still be a mess)
- Put all hardware CRC .c files into a single file pg_crc32c_hw.c.
Define PG_CRC_HW8/4/2/1 macros in pg_crc32c.h that tell which
intrinsic to use by platform and have a separate pg_crc32c_hw_impl.h
header that has the simple loop with these macros. That header would
for now only be included into pg_crc32c_hw.c.

The first two could be done as part of this patch or soon after. The
third is a bit more invasive, but seems like a necessary prerequisite
for inlining on small constant input, to keep that from being a mess.

There's another potential irritation for maintenance too: I spent too
much time only adding pg_cpucap_initialize() to frontend main()
functions that need it. I should have just added them to every binary.
We don't add new programs often, but it's still less automatic than
the function pointer way, so I'm open to other ideas.

Attached v7 to fix CI failures.
--
John Naylor
Amazon Web Services

Attachment Content-Type Size
v7-0002-Add-a-Postgres-SQL-function-for-crc32c-benchmarki.patch text/x-patch 6.4 KB
v7-0003-Improve-CRC32C-performance-on-x86_64.patch text/x-patch 8.6 KB
v7-0001-Dispatch-CRC-computation-by-branching-rather-than.patch text/x-patch 19.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David G. Johnston 2025-02-18 06:49:45 Re: ReplicationSlotRelease() crashes when the instance is in the single user mode
Previous Message Michael Paquier 2025-02-18 06:15:48 Re: ReplicationSlotRelease() crashes when the instance is in the single user mode