RE: Improve CRC32C performance on SSE4.2

From: "Devulapalli, Raghuveer" <raghuveer(dot)devulapalli(at)intel(dot)com>
To: John Naylor <johncnaylorls(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>
Cc: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "Shankaran, Akash" <akash(dot)shankaran(at)intel(dot)com>
Subject: RE: Improve CRC32C performance on SSE4.2
Date: 2025-02-18 17:46:38
Message-ID: PH8PR11MB8286FED80CBA0741A113292AFBFA2@PH8PR11MB8286.namprd11.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi John,

I added dispatch logic for the new pclmul version on top of your v5 (which seems outdated now and so I will refrain from posting a patch here). Could you take a look at the approach here to see if this makes sense? The logic in the pg_comp_crc32c_choose function is the main change and seems a lot cleaner to me.

See https://github.com/r-devulap/postgressql/pull/5/commits/cf80f7375f24d2fb5cd29e95dc73d4988fc6d066/

Raghuveer

> -----Original Message-----
> From: John Naylor <johncnaylorls(at)gmail(dot)com>
> Sent: Monday, February 17, 2025 10:40 PM
> To: Nathan Bossart <nathandbossart(at)gmail(dot)com>
> Cc: Devulapalli, Raghuveer <raghuveer(dot)devulapalli(at)intel(dot)com>; pgsql-
> hackers(at)lists(dot)postgresql(dot)org; Shankaran, Akash <akash(dot)shankaran(at)intel(dot)com>
> Subject: Re: Improve CRC32C performance on SSE4.2
>
> On Tue, Feb 18, 2025 at 12:41 AM Nathan Bossart <nathandbossart(at)gmail(dot)com>
> wrote:
> >
> > On Mon, Feb 17, 2025 at 05:58:01PM +0700, John Naylor wrote:
> > > I tried using branching for the runtime check, and this looks like
> > > the way to go:
> > > - Existing -msse4.2 builders will still call directly, but inside
> > > the function there is a length check and only for long input will it
> > > do a runtime check for pclmul.
> > > - This smooths the way for -msse4.2 (and the equivalent on Arm) to
> > > inline calls with short constant input (e.g. WAL insert lock),
> > > although I've not done that here.
> > > - This can be a simple starting point for consolidating runtime
> > > checks, as was proposed for popcount in the AVX-512 CRC thread, but
> > > with branching my model was Andres' sketch here:
> > >
> > > https://www.postgresql.org/message-id/20240731023918.ixsfbeuub6e76on
> > > e%40awork3.anarazel.de
> >
> > Oh, nifty. IIUC this should help avoid quite a bit of overhead even
> > before adding the PCLMUL stuff since it removes the function pointers
> > for the runtime-check builds.
>
> I figured the same, but it doesn't seem to help on the microbenchmark.
> (I also tried the pg_waldump benchmark you used, but I couldn't get it working so
> I'm probably missing a step.)
>
> master:
>
> 20
> latency average = 3.958 ms
> latency average = 3.860 ms
> latency average = 3.857 ms
> 32
> latency average = 5.921 ms
> latency average = 5.166 ms
> latency average = 5.128 ms
> 48
> latency average = 6.384 ms
> latency average = 6.022 ms
> latency average = 5.991 ms
>
> v6:
>
> 20
> latency average = 4.084 ms
> latency average = 3.879 ms
> latency average = 3.896 ms
> 32
> latency average = 5.533 ms
> latency average = 5.536 ms
> latency average = 5.573 ms
> 48
> latency average = 6.201 ms
> latency average = 6.205 ms
> latency average = 6.111 ms
>
> > While this needn't block this patch set, I do find the dispatch code
> > to be pretty complicated. Maybe we can improve that in the future by
> > using macros to auto-generate much of it. My concern here is less
> > about this particular patch set and more about the long term
> > maintainability as we add more and more stuff like it, each with its
> > own tangled web of build and dispatch rules.
>
> I think the runtime dispatch is fairly simple for this case. To my taste, the worse
> maintainability hazard is the building. To make that better, I'd do this:
> - Rename the CRC choose*.c files to pg_cpucap_{x86,arm}.c and build them
> unconditionally for each platform
> - Initialize the runtime info by CPU platform and not other symbols where possible
> (I guess anything needing AVX-512 will still be a mess)
> - Put all hardware CRC .c files into a single file pg_crc32c_hw.c.
> Define PG_CRC_HW8/4/2/1 macros in pg_crc32c.h that tell which intrinsic to use
> by platform and have a separate pg_crc32c_hw_impl.h header that has the
> simple loop with these macros. That header would for now only be included into
> pg_crc32c_hw.c.
>
> The first two could be done as part of this patch or soon after. The third is a bit
> more invasive, but seems like a necessary prerequisite for inlining on small
> constant input, to keep that from being a mess.
>
> There's another potential irritation for maintenance too: I spent too much time
> only adding pg_cpucap_initialize() to frontend main() functions that need it. I
> should have just added them to every binary.
> We don't add new programs often, but it's still less automatic than the function
> pointer way, so I'm open to other ideas.
>
> Attached v7 to fix CI failures.
> --
> John Naylor
> Amazon Web Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2025-02-18 18:04:03 moving some code out of explain.c
Previous Message Sergey Belyashov 2025-02-18 17:41:55 Re: BUG #18815: Logical replication worker Segmentation fault