Re: CRC32C Parallel Computation Optimization on ARM

From: John Naylor <johncnaylorls(at)gmail(dot)com>
To: Nathan Bossart <nathandbossart(at)gmail(dot)com>
Cc: Xiang Gao <Xiang(dot)Gao(at)arm(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: CRC32C Parallel Computation Optimization on ARM
Date: 2024-12-12 02:30:59
Message-ID: CANWCAZbO46fMgK1K5Tk24HLh9dc8cwFnK1v1Q=dxqLkfweO9ig@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Dec 11, 2024 at 11:54 PM Nathan Bossart
<nathandbossart(at)gmail(dot)com> wrote:
>
> On Wed, Dec 11, 2024 at 02:08:58PM +0700, John Naylor wrote:

> > and how light it was. With more hardware support, we can go much lower
> > than 1024 bytes, but that can be left for future work.
>
> Nice. I'm curious how this compares to both the existing implementations
> and the proposed ones that require new intrinsics. I like the idea of
> avoiding new runtime and config checks, especially if the performance is
> somewhat comparable for the most popular cases (i.e., dozens of bytes to a
> few thousand bytes).

With 8k inputs on x86 its fairly close to 3x faster than master.

I wasn't very clear, but v9 still has a cutoff of 1008 bytes just to
copy from 0008, but on a slightly old machine the crossover point is
about 400-600 bytes. Doing microbenchmarks that hammer on single
instructions is very finicky, so I don't trust these numbers much.

With hardware CLMUL, I'm guessing cutoff would be between 120 and 192
bytes (must be a multiple of 24 -- 3 words), and would depend on
architecture. Arm has an advantage that vmull_p64() operates on
scalars, but on x86 the corresponding operation is
_mm_clmulepi64_si128() , and there's a bit of shuffling in and out of
vector registers.

> If we still want to add new intrinsics, would it be easy enough to add them
> on top of this patch? Or would it require further restructuring?

I'm still trying to wrap my head around how function selection works
after commit 4b03a27fafc , but it could be something like this on x86:

#if defined(__has_attribute) && __has_attribute (target)

pg_attribute_target("sse4.2,pclmul")
pg_comp_crc32c_sse42
{
<big loop with special case for end>
<hardware carryless multiply>
<tail>
}

#endif

pg_attribute_target("sse4.2")
pg_comp_crc32c_sse42
{
<big loop>
<software carryless multiply>
<tail>
}

...where we have the tail part in a separate function for readability.

On Arm it might have to be as complex as in 0008, since as you've
mentioned, compiler support for the needed attributes is still pretty
new.

--
John Naylor
Amazon Web Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2024-12-12 02:32:19 Re: pg_createsubscriber TAP test wrapping makes command options hard to read.
Previous Message Michael Paquier 2024-12-12 02:29:03 Re: Pass ParseState as down to utility functions.