RE: Improve CRC32C performance on SSE4.2

From: "Devulapalli, Raghuveer" <raghuveer(dot)devulapalli(at)intel(dot)com>
To: John Naylor <johncnaylorls(at)gmail(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "Shankaran, Akash" <akash(dot)shankaran(at)intel(dot)com>
Subject: RE: Improve CRC32C performance on SSE4.2
Date: 2025-02-27 22:02:34
Message-ID: PH8PR11MB82862B464609340FB0A4BA37FBCD2@PH8PR11MB8286.namprd11.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi John,

Going back to your earlier comment:

> > > I'm not a fan of exposing these architecture-specific details to
> > > places that consult the capabilities. That requires things like
+#define PGCPUCAP_CRC32C pg_cpu_have(PG_CPU_FEATURE_SSE42)

> > Isn't that one thing currently pg_cpu_have(FEATURE)?
>
> No, I mean the one thing would be "do we have hardware CRC?", and each
> architecture would set (and check if needed) the same slot.
>
> I'm not 100% sure this is the better way, and there may be a disadvantage I
> haven't thought of, but this makes the most sense to me. I'm willing to hear
> reasons to do it differently, but I'm thinking those reasons need to be weighed
> against the loss of abstraction.

IMO, CPU SIMD (SSE4.2, AVX, etc.) features are a module of their own separate from capabilities/features supported in postgres (CRC32C, bitcount, etc.). Exposing architecture-specific details to the features that need to check against them is a way to make code more modular and reusable. It is reasonable to expect developers writing SIMD specific functions to naturally understand the runtime architecture requirements for that function which is easily accessible by just checking the value of pg_cpu_have(PG_CPU_FEATURE_..). If another feature in postgres requires SSE4.2, then the CPU initialization module doesn't need to be modified at all. They just have to worry about their feature and its CPU requirements.

> Just so we're on the same page, the current separate files do initialization, not
> dispatch. I'm seeing conflicting design goals
> here:
>
> 1. Have multiple similar dispatch functions with the only difference (for now)
> being certain names (not in the patch, just discussed).
> 2. Put initialization routines in the same file, even though they have literally
> nothing in common.

Sorry for not being clear. I will move the initialization routines to separate files. Just haven’t gotten to it yet with v10.

>
> > This also makes it easier to add more implementations in the future without
> having to make the dispatch function work for both ARM and x86.
>
> I don't quite understand, since with 0001 it already works (at least in theory) for 3
> architectures with hardware CRC, plus those without, and I don't see it changing
> with more architectures.

The reason its working across all architectures as of now is because there is only one runtime check for CRC32C feature across x86, ARM and LongArch. (PCLMUL dispatch happens inside the SSE42). If and when we add the AVX-512 implementation, won't the dispatch logic will look different for x86 and ARM? Along with that, the header file pg_crc32c.h will also look a lot messier with a whole bunch of nested #ifdefs. IMO, the header file should be devoid of any architecture dispatch logic and simply contain declarations of various implementations (see https://github.com/r-devulap/postgressql/blob/cf80f7375f24d2fb5cd29e95dc73d4988fc6d066/src/include/port/pg_crc32c.h for example). The dispatch logic should be handled separately in a C file.

>
> +/* Access to pg_cpucap for modules that need runtime CPUID information
> +*/ bool pg_cpu_have(int feature_id) {
> + return pg_cpucap[feature_id];
> }
>
> Putting this in a separate translation may have to do the decision to turn v8's
> #defines into an enum? In any case, this means doing a function call to decide
> which function to call. That makes no sense.

The reason it is a separate translational unit is because I intended pg_cpucap to be a read only variable outside of pg_cpucap.c. If the function overhead is not preferred, then I can get rid of it.

>
> The goal of v9/10 was to centralize the initialization from cpuid etc, but it also did
> a major re-architecture of the runtime-checks at the same. That made review
> more difficult.

I assumed we wanted to move the runtime checks to a central place and that needed this re-architecture.

>
> +#if defined(__x86_64__) || defined ...
> + pg_cpucap_x86();
> +#elif defined(__aarch64__) || defined ...
> + pg_cpucap_arm();
> +#endif
>
> I view this another conflict in design goals: After putting everything in the same
> file, the routines still have different names and have to be guarded accordingly. I
> don't really see the advantage, especially since these guard symbols don't even
> match the ones that guard the actual initialization steps. I tried to imply that in
> my last review, but maybe I should have been more explicit.
>
> I think the least painful step is to take the x86 initialization from v10, which is
> looking great, but
> - keep separate initialization files
> - don't whack around the runtime representation, at least not in the same patch

Sure. I can fix that in v11.

Raghuveer

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2025-02-27 22:04:06 Re: Should work_mem be stable for a prepared statement?
Previous Message Masahiko Sawada 2025-02-27 22:01:59 Re: Update docs for UUID data type