From: | "Devulapalli, Raghuveer" <raghuveer(dot)devulapalli(at)intel(dot)com> |
---|---|
To: | "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, John Naylor <johncnaylorls(at)gmail(dot)com> |
Cc: | "Shankaran, Akash" <akash(dot)shankaran(at)intel(dot)com>, "Devulapalli, Raghuveer" <raghuveer(dot)devulapalli(at)intel(dot)com> |
Subject: | RE: Improve CRC32C performance on SSE4.2 |
Date: | 2025-02-11 00:24:45 |
Message-ID: | PH8PR11MB8286F844321BA1DEEC518348FBFD2@PH8PR11MB8286.namprd11.prod.outlook.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi John,
> I'm highly suspicious of these numbers because they show this version
> is about 20x faster than "scalar", so relatively speaking 3x faster
> than the AVX-512 proposal?
Apologies for this. I was suspicious of this too and looks like I had unintentionally set the scalar version I wrote for testing CRC32C correctness (which computes crc one byte at time). The code for microbenchmarking is all here: https://github.com/r-devulap/crc32c and this commit https://github.com/r-devulap/crc32c/commit/ca4d1b24fd8af87aab544fb1634523b6657325a0 fixes that. Rerunning the benchmarks gives me more sensible numbers:
Comparing scalar_crc32c to sse42_crc32c (from ./bench)
Benchmark Time CPU Time Old Time New CPU Old CPU New
------------------------------------------------------------------------------------------------------------------------------------
[scalar_crc32c vs. sse42_crc32c]/64 -0.0972 -0.0971 5 4 5 4
[scalar_crc32c vs. sse42_crc32c]/128 -0.3048 -0.3048 8 6 8 6
[scalar_crc32c vs. sse42_crc32c]/256 -0.4610 -0.4610 19 10 19 10
[scalar_crc32c vs. sse42_crc32c]/512 -0.6432 -0.6432 50 18 50 18
[scalar_crc32c vs. sse42_crc32c]/1024 -0.7192 -0.7192 121 34 121 34
[scalar_crc32c vs. sse42_crc32c]/2048 -0.7275 -0.7276 259 70 259 70
> Luckily, that's easily fixable: It turns out the implementation
> in the paper (and chromium) has a very inefficient finalization step,
> using more carryless multiplications and plenty of other operations.
> After the main loop, and at the end, it's much more efficient to
> convert the 128-bit intermediate result directly into a CRC in the
> usual way.
Thank you for pointing this out and also fixing it! This improves over the chromium version by 10-25% especially for with smaller byte size 64 - 512 bytes:
Comparing sse42_crc32c to corsix_crc32c (from ./bench)
Benchmark Time CPU Time Old Time New CPU Old CPU New
------------------------------------------------------------------------------------------------------------------------------------
[sse42_crc32c vs. corsix_crc32c]/64 -0.2696 -0.2696 4 3 4 3
[sse42_crc32c vs. corsix_crc32c]/128 -0.1551 -0.1552 6 5 6 5
[sse42_crc32c vs. corsix_crc32c]/256 -0.1787 -0.1787 10 8 10 8
[sse42_crc32c vs. corsix_crc32c]/512 -0.1351 -0.1351 18 15 18 15
[sse42_crc32c vs. corsix_crc32c]/1024 -0.0972 -0.0972 34 31 34 31
[sse42_crc32c vs. corsix_crc32c]/2048 -0.0763 -0.0763 69 64 69 64
OVERALL_GEOMEAN -0.1544 -0.1544 0 0 0 0
> I generated a similar function for v2-0004 and this benchmark shows it's faster than master on 128 bytes and above.
I ran the same benchmark drive_crc32c with the postgres infrastructure and found that your v2 sse42 version from corsix is slower than pg_comp_crc32c_sse42 in master branch when buffer is < 128 bytes. I think the reason is that postgres is not using -O3 flag build the crc32c source files and the compiler generates less than optimal code. Adding that flag fixes the regression for buffers with 64 bytes - 128 bytes. Could you confirm that behavior on your end too?
---
src/port/pg_crc32c_sse42.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/src/port/pg_crc32c_sse42.c b/src/port/pg_crc32c_sse42.c
index a8c1e5609b..a350b1b93a 100644
--- a/src/port/pg_crc32c_sse42.c
+++ b/src/port/pg_crc32c_sse42.c
@@ -81,6 +81,7 @@ pg_comp_crc32c_sse42_tail(pg_crc32c crc, const void *data, size_t len)
#define clmul_lo(a, b) (_mm_clmulepi64_si128((a), (b), 0))
#define clmul_hi(a, b) (_mm_clmulepi64_si128((a), (b), 17))
pg_attribute_no_sanitize_alignment()
+__attribute__((optimize("-O3")))
pg_attribute_target("sse4.2,pclmul")
pg_crc32c
pg_comp_crc32c_sse42(pg_crc32c crc0, const void *data, size_t len)
--
You could also just build with export CFLAGS="-O3" instead of adding the function attribute.
> I did the benchmarks on my older machine, which I believe has a latency of 7 cycles for this instruction.
May I ask which processor does you older machine have? I am benchmarking on a Tigerlake processor.
> It's probably okay to fold these together in the same compile-time
> check, since both are fairly old by now, but for those following
> along, pclmul is not in SSE 4.2 and is a bit newer. So this would
> cause machines building on Nehalem (2008) to fail the check and go
> back to slicing-by-8 with it written this way.
Technically, the current version of the patch does not have a runtime cpuid check for pclmul and so would cause it to crash with segill on Nehalam (currently we only check for sse4.2). This needs to be fixed by adding an additional cpuid check for pcmul but it would fall back to slicing by 8 on Nehalem and use the latest version on Westmere and above. If you care about keeping the performance on Nehalem, then I am happy to update the choose function to pick the right pointer accordingly. Let me know which one you would prefer.
Raghuveer
From: Devulapalli, Raghuveer <raghuveer(dot)devulapalli(at)intel(dot)com>
Sent: Wednesday, February 5, 2025 12:49 PM
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Cc: Shankaran, Akash <akash(dot)shankaran(at)intel(dot)com>; Devulapalli, Raghuveer <raghuveer(dot)devulapalli(at)intel(dot)com>
Subject: Improve CRC32C performance on SSE4.2
This patch improves the performance of SSE42 CRC32C algorithm. The current SSE4.2 implementation of CRC32C relies on the native crc32 instruction and processes 8 bytes at a time in a loop. The technique in this paper uses the pclmulqdq instruction and processing 64 bytes at time. The algorithm is based on sse42 version of crc32 computation from Chromium's copy of zlib with modified constants for crc32c computation. See:
https://chromium.googlesource.com/chromium/src/+/refs/heads/main/third_party/zlib/crc32_simd.c
Microbenchmarks (generated with google benchmark using a standalone version of the same algorithms):
Comparing scalar_crc32c to sse42_crc32c (for various buffer sizes: 64, 128, 256, 512, 1024, 2048 bytes)
Benchmark Time CPU Time Old Time New CPU Old CPU New
------------------------------------------------------------------------------------------------------------------------------------
[scalar_crc32c vs. sse42_crc32c]/64 -0.8147 -0.8148 33 6 33 6
[scalar_crc32c vs. sse42_crc32c]/128 -0.8962 -0.8962 88 9 88 9
[scalar_crc32c vs. sse42_crc32c]/256 -0.9200 -0.9200 211 17 211 17
[scalar_crc32c vs. sse42_crc32c]/512 -0.9389 -0.9389 486 30 486 30
[scalar_crc32c vs. sse42_crc32c]/1024 -0.9452 -0.9452 1037 57 1037 57
[scalar_crc32c vs. sse42_crc32c]/2048 -0.9456 -0.9456 2140 116 2140 116
Raghuveer
From | Date | Subject | |
---|---|---|---|
Next Message | David G. Johnston | 2025-02-11 00:25:42 | Re: describe special values in GUC descriptions more consistently |
Previous Message | Sami Imseih | 2025-02-11 00:14:14 | Re: Proposal to CREATE FOREIGN TABLE LIKE |