RE: Improve CRC32C performance on SSE4.2

From: "Devulapalli, Raghuveer" <raghuveer(dot)devulapalli(at)intel(dot)com>
To: John Naylor <johncnaylorls(at)gmail(dot)com>
Cc: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "Shankaran, Akash" <akash(dot)shankaran(at)intel(dot)com>
Subject: RE: Improve CRC32C performance on SSE4.2
Date: 2025-02-11 21:34:46
Message-ID: PH8PR11MB828604A88A2E6C5EB5FF71E1FBFD2@PH8PR11MB8286.namprd11.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

Attached v3 which is same as v2 with the added PCLMULQDQ runtime CPUID check.

> > I ran the same benchmark drive_crc32c with the postgres infrastructure and
> found that your v2 sse42 version from corsix is slower than
> pg_comp_crc32c_sse42 in master branch when buffer is < 128 bytes.
>
> That matches my findings as well.

Never mind, I was building using the Makefile which doesn’t seem to add any optimization flag by default. I switched to using meson which uses -O2 and benchmarked using pgbench (using your script) and this behavior goes away on my TGL. Here is what I measure with your v2 (and v3):

| bytes | master (ms) | sse4.2-v2 (ms) | ratio |
| 64 | 9.627 | 6.306 | 1.52 |
| 80 | 10.976 | 6.662 | 1.64 |
| 96 | 12.411 | 8.212 | 1.51 |
| 112 | 13.871 | 9.403 | 1.47 |
| 128 | 15.283 | 7.724 | 1.97 |
| 144 | 16.715 | 9.173 | 1.82 |
| 160 | 18.18 | 11.292 | 1.60 |
| 176 | 19.847 | 12.606 | 1.57 |
| 192 | 22.043 | 10.16 | 2.16 |
| 208 | 24.261 | 11.699 | 2.07 |
| 224 | 26.63 | 13.607 | 1.95 |
| 240 | 28.994 | 14.721 | 1.96 |
| 256 | 31.418 | 13.132 | 2.39 |


> On my machine that still regresses compared to master in that range (although by
> not as much) so I still think 128 bytes is the right threshold.

On my TGL, buffer sizes as small as 64 bytes see performance benefits.

> The effect of -O3 with gcc14.2 is that the single-block loop (after the 4-block loop)
> is unrolled. Unrolling adds branches and binary space, so it'd be nice to avoid that
> even for systems that build with -O3.

Agreed. My perf data shows -O2 is just as good.

> Okay, Nehalem is 17 years old, and the additional cpuid check would still work on
> hardware 14-15 years old, so I think it's fine to bump the requirement for runtime
> hardware support.

Sounds good. I updated the runtime check to include PCLMULQDQ. New algorithm will run only on Westmere and newer CPU.

Raghuveer

Attachment Content-Type Size
v3-0001-Add-more-test-coverage-for-crc32c.patch application/octet-stream 3.4 KB
v3-0002-Add-a-Postgres-SQL-function-for-crc32c-benchmarki.patch application/octet-stream 6.4 KB
v3-0003-Improve-CRC32C-performance-on-SSE4.2.patch application/octet-stream 10.8 KB
v3-0004-Shorter-version-from-corsix.patch application/octet-stream 7.7 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Matthias van de Meent 2025-02-11 21:40:41 Re: Expanding HOT updates for expression and partial indexes
Previous Message Peter Smith 2025-02-11 21:23:37 Re: Introduce XID age and inactive timeout based replication slot invalidation