From: | John Naylor <johncnaylorls(at)gmail(dot)com> |
---|---|
To: | Nathan Bossart <nathandbossart(at)gmail(dot)com> |
Cc: | Xiang Gao <Xiang(dot)Gao(at)arm(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: CRC32C Parallel Computation Optimization on ARM |
Date: | 2024-12-12 02:30:59 |
Message-ID: | CANWCAZbO46fMgK1K5Tk24HLh9dc8cwFnK1v1Q=dxqLkfweO9ig@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Dec 11, 2024 at 11:54 PM Nathan Bossart
<nathandbossart(at)gmail(dot)com> wrote:
>
> On Wed, Dec 11, 2024 at 02:08:58PM +0700, John Naylor wrote:
> > and how light it was. With more hardware support, we can go much lower
> > than 1024 bytes, but that can be left for future work.
>
> Nice. I'm curious how this compares to both the existing implementations
> and the proposed ones that require new intrinsics. I like the idea of
> avoiding new runtime and config checks, especially if the performance is
> somewhat comparable for the most popular cases (i.e., dozens of bytes to a
> few thousand bytes).
With 8k inputs on x86 its fairly close to 3x faster than master.
I wasn't very clear, but v9 still has a cutoff of 1008 bytes just to
copy from 0008, but on a slightly old machine the crossover point is
about 400-600 bytes. Doing microbenchmarks that hammer on single
instructions is very finicky, so I don't trust these numbers much.
With hardware CLMUL, I'm guessing cutoff would be between 120 and 192
bytes (must be a multiple of 24 -- 3 words), and would depend on
architecture. Arm has an advantage that vmull_p64() operates on
scalars, but on x86 the corresponding operation is
_mm_clmulepi64_si128() , and there's a bit of shuffling in and out of
vector registers.
> If we still want to add new intrinsics, would it be easy enough to add them
> on top of this patch? Or would it require further restructuring?
I'm still trying to wrap my head around how function selection works
after commit 4b03a27fafc , but it could be something like this on x86:
#if defined(__has_attribute) && __has_attribute (target)
pg_attribute_target("sse4.2,pclmul")
pg_comp_crc32c_sse42
{
<big loop with special case for end>
<hardware carryless multiply>
<tail>
}
#endif
pg_attribute_target("sse4.2")
pg_comp_crc32c_sse42
{
<big loop>
<software carryless multiply>
<tail>
}
...where we have the tail part in a separate function for readability.
On Arm it might have to be as complex as in 0008, since as you've
mentioned, compiler support for the needed attributes is still pretty
new.
--
John Naylor
Amazon Web Services
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Smith | 2024-12-12 02:32:19 | Re: pg_createsubscriber TAP test wrapping makes command options hard to read. |
Previous Message | Michael Paquier | 2024-12-12 02:29:03 | Re: Pass ParseState as down to utility functions. |