From: | "Devulapalli, Raghuveer" <raghuveer(dot)devulapalli(at)intel(dot)com> |
---|---|
To: | John Naylor <johncnaylorls(at)gmail(dot)com> |
Cc: | "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "Shankaran, Akash" <akash(dot)shankaran(at)intel(dot)com> |
Subject: | RE: Improve CRC32C performance on SSE4.2 |
Date: | 2025-02-11 21:34:46 |
Message-ID: | PH8PR11MB828604A88A2E6C5EB5FF71E1FBFD2@PH8PR11MB8286.namprd11.prod.outlook.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hello,
Attached v3 which is same as v2 with the added PCLMULQDQ runtime CPUID check.
> > I ran the same benchmark drive_crc32c with the postgres infrastructure and
> found that your v2 sse42 version from corsix is slower than
> pg_comp_crc32c_sse42 in master branch when buffer is < 128 bytes.
>
> That matches my findings as well.
Never mind, I was building using the Makefile which doesn’t seem to add any optimization flag by default. I switched to using meson which uses -O2 and benchmarked using pgbench (using your script) and this behavior goes away on my TGL. Here is what I measure with your v2 (and v3):
| bytes | master (ms) | sse4.2-v2 (ms) | ratio |
| 64 | 9.627 | 6.306 | 1.52 |
| 80 | 10.976 | 6.662 | 1.64 |
| 96 | 12.411 | 8.212 | 1.51 |
| 112 | 13.871 | 9.403 | 1.47 |
| 128 | 15.283 | 7.724 | 1.97 |
| 144 | 16.715 | 9.173 | 1.82 |
| 160 | 18.18 | 11.292 | 1.60 |
| 176 | 19.847 | 12.606 | 1.57 |
| 192 | 22.043 | 10.16 | 2.16 |
| 208 | 24.261 | 11.699 | 2.07 |
| 224 | 26.63 | 13.607 | 1.95 |
| 240 | 28.994 | 14.721 | 1.96 |
| 256 | 31.418 | 13.132 | 2.39 |
> On my machine that still regresses compared to master in that range (although by
> not as much) so I still think 128 bytes is the right threshold.
On my TGL, buffer sizes as small as 64 bytes see performance benefits.
> The effect of -O3 with gcc14.2 is that the single-block loop (after the 4-block loop)
> is unrolled. Unrolling adds branches and binary space, so it'd be nice to avoid that
> even for systems that build with -O3.
Agreed. My perf data shows -O2 is just as good.
> Okay, Nehalem is 17 years old, and the additional cpuid check would still work on
> hardware 14-15 years old, so I think it's fine to bump the requirement for runtime
> hardware support.
Sounds good. I updated the runtime check to include PCLMULQDQ. New algorithm will run only on Westmere and newer CPU.
Raghuveer
Attachment | Content-Type | Size |
---|---|---|
v3-0001-Add-more-test-coverage-for-crc32c.patch | application/octet-stream | 3.4 KB |
v3-0002-Add-a-Postgres-SQL-function-for-crc32c-benchmarki.patch | application/octet-stream | 6.4 KB |
v3-0003-Improve-CRC32C-performance-on-SSE4.2.patch | application/octet-stream | 10.8 KB |
v3-0004-Shorter-version-from-corsix.patch | application/octet-stream | 7.7 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Matthias van de Meent | 2025-02-11 21:40:41 | Re: Expanding HOT updates for expression and partial indexes |
Previous Message | Peter Smith | 2025-02-11 21:23:37 | Re: Introduce XID age and inactive timeout based replication slot invalidation |