Proposal for Updating CRC32C with AVX-512 Algorithm.

From: "Amonson, Paul D" <paul(dot)d(dot)amonson(at)intel(dot)com>
To: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, "Shankaran, Akash" <akash(dot)shankaran(at)intel(dot)com>
Subject: Proposal for Updating CRC32C with AVX-512 Algorithm.
Date: 2024-05-01 15:56:08
Message-ID: BL1PR11MB530401FA7E9B1CA432CF9DC3DC192@BL1PR11MB5304.namprd11.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Comparing the current SSE4.2 implementation of the CRC32C algorithm in Postgres, to an optimized AVX-512 algorithm [0] we observed significant gains. The result was a ~6.6X average multiplier of increased performance measured on 3 different Intel products. Details below. The AVX-512 algorithm in C is a port of the ISA-L library [1] assembler code.

Workload call size distribution details (write heavy):
* Average was approximately around 1,010 bytes per call
* ~80% of the calls were under 256 bytes
* ~20% of the calls were greater than or equal to 256 bytes up to the max buffer size of 8192

The 256 bytes is important because if the buffer is smaller, it makes sense fallback to the existing implementation. This is because the AVX-512 algorithm needs a minimum of 256 bytes to operate.

Using the above workload data distribution,
at 0% calls < 256 bytes, a 841% improvement on average for crc32c functionality was observed.
at 50% calls < 256 bytes, a 758% improvement on average for crc32c functionality was observed.
at 90% calls < 256 bytes, a 44% improvement on average for crc32c functionality was observed.
at 97.6% calls < 256 bytes, the workload's crc32c performance breaks-even.
at 100% calls < 256 bytes, a 14% regression is seen when using AVX-512 implementation.

The results above are averages over 3 machines, and were measured on: Intel Saphire Rapids bare metal, and using EC2 on AWS cloud: Intel Saphire Rapids (m7i.2xlarge) and Intel Ice Lake (m6i.2xlarge).

Summary Data (Saphire Rapids bare metal, AWS m7i-2xl, and AWS m6i-2xl):
+---------------------+-------------------+-------------------+-------------------+--------------------+
| Rates in Bytes/us | Bare Metal | AWS m6i-2xl | AWS m7i-2xl | |
| (Larger is Better) +---------+---------+---------+---------+---------+---------+ Overall Multiplier |
| | SSE 4.2 | AVX-512 | SSE 4.2 | AVX-512 | SSE 4.2 | AVX-512 | |
+---------------------+---------+---------+---------+---------+---------+---------+--------------------+
| Numbers 256-8192 | 12,046 | 83,196 | 7,471 | 39,965 | 11,867 | 84,589 | 6.62 |
+---------------------+---------+---------+---------+---------+---------+---------+--------------------+
| Numbers 64 - 255 | 16,865 | 15,909 | 9,209 | 7,363 | 12,496 | 10,046 | 0.86 |
+---------------------+---------+---------+---------+---------+---------+---------+--------------------+
| Weighted Multiplier [*] | 1.44 |
+-----------------------------+--------------------+
There was no evidence of AVX-512 frequency throttling from perf data, which stayed steady during the test.

Feedback on this proposed improvement is appreciated. Some questions:
1) This AVX-512 ISA-L derived code uses BSD-3 license [2]. Is this compatible with the PostgreSQL License [3]? They both appear to be very permissive licenses, but I am not an expert on licenses.
2) Is there a preferred benchmark I should run to test this change?

If licensing is a non-issue, I can post the initial patch along with my Postgres benchmark function patch for further review.

Thanks,
Paul

[0] https://www.researchgate.net/publication/263424619_Fast_CRC_computation#full-text
[1] https://github.com/intel/isa-l
[2] https://opensource.org/license/bsd-3-clause
[3] https://opensource.org/license/postgresql

[*] Weights used were 90% of requests less than 256 bytes, 10% greater than or equal to 256 bytes.

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Cary Huang 2024-05-01 16:44:14 Re: Support tid range scan in parallel?
Previous Message Thom Brown 2024-05-01 15:48:45 Re: Document NULL