Re: Changing default -march landscape

From: Andres Freund <andres(at)anarazel(dot)de>
To: Nathan Bossart <nathandbossart(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Changing default -march landscape
Date: 2024-07-31 02:39:18
Message-ID: 20240731023918.ixsfbeuub6e76one@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Reply triggered by https://postgr.es/m/ZqmRYh3iikm1Kh3D%40nathan

On 2024-06-12 20:09:45 -0500, Nathan Bossart wrote:
> This is perhaps only tangentially related, but I've found it really
> difficult to avoid painting ourselves into a corner with this stuff. Let's
> use the SSE 4.2 CRC32C code as an example. Right now, if your default
> compiler flags indicate support for SSE 4.2 (which I'll note can be assumed
> with x86-64-v2), we'll use it unconditionally, no function pointer
> required. If additional compiler flags happen to convince the compiler to
> generate SSE 4.2 code, we'll instead build both a fallback version and the
> SSE version, and then we'll use a function pointer to direct to whatever we
> detect is available on the CPU when the server starts.

> Now, let's say we require x86-64-v2. Once we have that, we can avoid the
> function pointer on many more x86 machines. While that sounds great, now
> we have a different problem. If someone wants to add, say, AVX-512 support
> [0], which is a much newer instruction set, we'll need to use the function
> pointer again. And we're back where we started. We could instead just ask
> folks to compile with --march=native, but then these optimizations are only
> available for a subset of users until we decide the instructions are
> standard enough 20 years from now.

I don't think this is that big an issue:

So much of the avx512 stuff is only available on an almost arbitrary seeming
set of platforms, so we'll need runtime tests approximately forever. Which
also means we need a fairly standard pattern for deciding whether the runtime
dispatch is worth the cost. That's pretty independent of whether we e.g. want
to implement pg_popcount64 without runtime dispatch, it'd not make sense to
use avx512 for that anyway.

> The idea that's been floating around recently is to build a bunch of
> different versions of Postgres and to choose one on startup based on what
> the CPU supports. That seems like quite a lot of work, and it'll increase
> the size of the builds quite a bit, but it at least doesn't have the
> aforementioned problem.

I still think that's a good idea - but I don't think it's gonna deal well with
the various avx512 "features". The landscape of what's supported on what CPU
is so comically complicated that there's no way to build separate versions for
everything.

We can hide most of the dispatch cost in a static inline function that only
does the runtime test if size is large enough - the size is a compile time
constant most of the time, which optimizes away the dispatch cost most of the
time. And even if not, an inlined runtime branch is a *lot* cheaper than an
indirect function call.

I think this should be something *roughly* like this:

/* shared between all cpu-type dependent code */

#define PGCPUCAP_INIT (1 << 0)
#define PGCPUCAP_POPCNT (1 << 1)
#define PGCPUCAP_VPOPCNT (1 << 2)
#define PGCPUCAP_CRC32C (1 << 3)
...

extern uint32 pg_cpucap; /* possibly an bool array would be better */
extern void pg_cpucap_initialize(void);

static inline int
pg_popcount64(uint64 word)
{
Assert(pg_cpucap & PGCPUCAP_INIT);

#if defined(HAVE_POPCNT64_COMPILETIME) || defined(HAVE_POPCNT64_RUNTIME)

#if defined(HAVE_POPCNT64_RUNTIME)
if (pg_cpucap & PGCPUCAP_POPCNT64)
#endif
{
return pg_popcount64_fast(buf, bytes);
}
/* fall through if not available */
#else
return pg_popcount64_slow(word);
#endif
}

/* badly guessed */
#define AVX512_THRESHOLD 64

static inline uint64
pg_popcount(const char *buf, int bytes)
{
uint64 popcnt = 0;

Assert(pg_cpucap & PGCPUCAP_INIT);

/*
* Most of the times `bytes` will be a compile time constant and the
* branches below can be optimized out. Even if they can't, a branch or
* three here is a lot cheaper than an indirect function call.
*/

#if defined(HAVE_AVX512_VPOPCNT_COMPILETIME) || defined(HAVE_AVX512_VPOPCNT_RUNTIME)
if (unlikely(bytes >= AVX512_THRESHOLD))
{
#if defined(HAVE_AVX512_VPOPCNT_RUNTIME)
if (pg_cpucap & PGCPUCAP_VPOPCNT)
#else
{
return pg_popcount_avx512(buf, bytes);
}
/* if not available we fall through to the unrolled implementation */
#endif
}
#endif /* HAVE_AVX512_VPOPCNT_* */

/* XXX: should probably be implemented in separate function */
if (bytes > 8)
{
while (bytes >= 8)
{
uint64 word;

/*
* Address could be unaligned, compiler will optimize this to a
* plain [unaligned] load on most architectures.
*/
memcpy(&word, buf, sizeof(uint64));

/*
* TODO: if compilers can't hoist the pg_cpucap check
* automatically, we should do so "manually".
*/
popcnt += pg_popcount64(word);

buf += sizeof(uint64);
bytes -= sizeof(uint64);
}
}

/*
* Handle tail, we can use the 64bit version by just loading the relevant
* portion of the data into a wider register.
*/
if (bytes > 0)
{
uint64 word = 0;

Assert(bytes < 8);

memcpy(&word, buf, bytes);

popcnt += pg_popcount64_fast(word);
}

return popcnt;
}

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2024-07-31 02:43:08 Re: Popcount optimization using AVX512
Previous Message Japin Li 2024-07-31 02:36:11 Remove unnecessary forward declaration for heapam_methods variable