Re: Improve CRC32C performance on SSE4.2

From: John Naylor <johncnaylorls(at)gmail(dot)com>
To: "Devulapalli, Raghuveer" <raghuveer(dot)devulapalli(at)intel(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "Shankaran, Akash" <akash(dot)shankaran(at)intel(dot)com>
Subject: Re: Improve CRC32C performance on SSE4.2
Date: 2025-02-26 05:22:25
Message-ID: CANWCAZbG_pPFuLRaczYApBykaHCXQ-STAXO1=J6_4TRvy23Byg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 26, 2025 at 7:21 AM Devulapalli, Raghuveer
<raghuveer(dot)devulapalli(at)intel(dot)com> wrote:
>
> > I agree it would be preferable to make a centralized check work.
>
> Here is my first stab at it. v9 is same as v8 + a commit to move all cpuid checks into one single place including the AVX512 popcount code. Any new feature that requires CPUID information can access that information with pg_cpu_have[FEATURE] defined in pg_cpucap.h and initialized in pg_cpucap.c. v8 also had a typo in configure files which caused a build failure. Its fixed in v9.
>
> Pretty sure the ARM code paths need some correction. Let me know what you think.

Thanks, I think this is a good direction. Some comments:

+static void pg_cpuid(int leaf, int subleaf, unsigned int* exx)
+{
+#if defined(HAVE__GET_CPUID_COUNT)
+ __get_cpuid_count(leaf, subleaf, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+ __cpuidex(exx, leaf, subleaf);
+#else
+#error cpuid instruction not available
+#endif
+}

Our current configure still looks for __get_cpuid and __cpuid. We
committed checking these new ones fairly recently, and they were
further gated by USE_AVX512_POPCNT_WITH_RUNTIME_CHECK. It seems like
here we should do something like the following, where "+" lines are
from the patch and other lines are mine:

+static void
+pg_cpucap_x86(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};

#if defined(HAVE__GET_CPUID)
__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
#elif defined(HAVE__CPUID)
__cpuid(exx, 1);
#endif

+ pg_cpucap[PG_CPU_FEATURE_SSE42] = (exx[2] & (1 << 20)) != 0;
+ pg_cpucap[PG_CPU_FEATURE_PCLMUL] = (exx[2] & (1 << 1)) != 0;
+ pg_cpucap[PG_CPU_FEATURE_POPCNT] = (exx[2] & (1 << 23)) != 0;
+ /* osxsave */
+ if ((exx[2] & (1 << 27)) == 0) {
+ return;
+ }
+ /* avx512 os support */
+ if (zmm_regs_available()) {
+ return;
+ }

// BTW, I do like the gating here that reduces the number of places
that have to know about zmm and xsave. (Side note: shouldn't that be
"if !(zmm_regs_available())"?)

+ /* reset for second cpuid call on leaf 7 to check extended avx512
support */

exx[4] = {0, 0, 0, 0};

#if defined(HAVE__GET_CPUID_COUNT)
__get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
#elif defined(HAVE__CPUIDEX)
__cpuidex(exx, 7, 0);
#endif

+ pg_cpucap[PG_CPU_FEATURE_AVX512F] = (exx[1] & (1 << 16)) != 0;
+ pg_cpucap[PG_CPU_FEATURE_AVX512BW] = (exx[1] & (1 << 30)) != 0;
+ pg_cpucap[PG_CPU_FEATURE_AVX512VPOPCNTDQ] = (exx[2] & (1 << 14)) != 0;
+
+}

What do you think?

-#define PGCPUCAP_INIT (1 << 0)
-#define PGCPUCAP_POPCNT (1 << 1)
-#define PGCPUCAP_VPOPCNT (1 << 2)
-#define PGCPUCAP_CRC32C (1 << 3)
-#define PGCPUCAP_CLMUL (1 << 4)
+enum pg_cpucap__
+{
+ PG_CPU_FEATURE_INIT = 0,
+ // X86
+ PG_CPU_FEATURE_SSE42 = 1,
+ PG_CPU_FEATURE_POPCNT = 2,
+ PG_CPU_FEATURE_PCLMUL = 3,
[...]
+ PG_CPU_FEATURE_ARMV8_CRC32C = 100,

I'm not a fan of exposing these architecture-specific details to
places that consult the capabilities. That requires things like

+#define PGCPUCAP_CRC32C pg_cpu_have(PG_CPU_FEATURE_SSE42)
[...]
+#define PGCPUCAP_CRC32C pg_cpu_have(PG_CPU_FEATURE_ARMV8_CRC32C)

I'd prefer to have 1 capability <-> one place to check at runtime for
architectures that need that, and to keep architecture details private
to the initialization step. . Even for things that test for which
function pointer to use, I think it's a cleaner interface to look at
one thing.

+static void
+pg_cpucap_arch()
+{
+ /* WIP: configure checks */
+#ifdef __x86_64__
+ pg_cpucap_x86();
+#else // ARM:
+ pg_cpucap_arm();
+#endif
+}

If we're going to have a single file for the init step, we don't need
this -- we'd just have a different definition of
pg_cpucap_initialize() in each part, with a default that only adds the
"init" slot:

#if defined( __i386__ ) || defined(i386) || defined(_M_IX86) ||
defined(__x86_64__) || defined(__x86_64) || defined(_M_AMD64)

<cpuid checks>

#elif defined(__arm__) || defined(__arm) ||
defined(__aarch64__) || defined(_M_ARM64)

<for now only pg_crc32c_armv8_available()>

#else
<only init>
#endif

--

John Naylor
Amazon Web Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2025-02-26 05:27:44 Re: Fix logging for invalid recovery timeline
Previous Message Paul Jungwirth 2025-02-26 05:15:43 Re: SQL:2011 application time