Popcount optimization using SVE for ARM

From: "Devanga(dot)Susmitha(at)fujitsu(dot)com" <Devanga(dot)Susmitha(at)fujitsu(dot)com>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Cc: "Ragesh(dot)Hajela(at)fujitsu(dot)com" <Ragesh(dot)Hajela(at)fujitsu(dot)com>, "Chiranmoy(dot)Bhattacharya(at)fujitsu(dot)com" <Chiranmoy(dot)Bhattacharya(at)fujitsu(dot)com>, "Rajat(dot)Ma(at)fujitsu(dot)com" <Rajat(dot)Ma(at)fujitsu(dot)com>
Subject: Popcount optimization using SVE for ARM
Date: 2024-12-06 05:54:15
Message-ID: OSZPR01MB84990A9A02A3515C6E85A65B8B2A2@OSZPR01MB8499.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello, This email is to discuss the contribution of the speed-up popcount and popcount mask feature we have developed for the ARM architecture using SVE intrinsics.
The current method for popcount on ARM relies on compiler intrinsics or C code, which processes data in a scalar fashion, handling one integer at a time. By leveraging SVE intrinsics for popcount, the execution can process multiple integers simultaneously, depending on the vector length, thereby significantly enhancing the performance of the functionality.
We have designed this feature to ensure compatibility and robustness. It includes compile-time and runtime checks for SVE compatibility with both the compiler and hardware. If either check fails, the code falls back to the existing scalar implementation, ensuring fail-safe operation. Additionally, we leveraged the existing infrastructure to select between different popcount implementations, avoiding additional complexity.

Algorithm Overview:
1. For larger inputs, align the buffers to avoid double loads. For smaller inputs alignment is not necessary and might even degrade the performance.
2. Process the aligned buffer chunk by chunk till the last incomplete chunk.
3. Process the last incomplete chunk.
Our setup:
Machine: AWS EC2 c7g.8xlarge - 32vcpu, 64gb RAM
OS : Ubuntu 22.04.5 LTS
GCC: 11.4

Benchmark and Result:
We have used John Naylor's popcount-test-module [0] for benchmarking and observed a speed-up of more than 3x for larger buffers. Even for smaller inputs of size 8 and 32 bytes there aren't any performance degradations observed.

[cid:da1f7dfc-7d31-438a-a5e8-579e96f4a8e0] [cid:05551fcb-925c-43f6-a2b4-4dc2341322fe]
We would like to contribute our above work so that it can be available for the community to utilize. To do so, we are following the procedure mentioned in Submitting a Patch - PostgreSQL wiki<https://wiki.postgresql.org/wiki/Submitting_a_Patch>. Please find the attachments for the patch and performance results.
Please let us know if you have any queries or suggestions.

Thanks & Regards,
Susmitha Devanga.

Attachment Content-Type Size
SVE_support_for_popcount.patch application/octet-stream 45.7 KB
image/png 64.5 KB
benchmarking-2.png image/png 78.2 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2024-12-06 06:01:06 Re: Windows pg_basebackup unable to create >2GB pg_wal.tar tarballs ("could not close file: Invalid argument" when creating pg_wal.tar of size ~ 2^31 bytes)
Previous Message Shlok Kyal 2024-12-06 05:40:26 Re: Disallow UPDATE/DELETE on table with unpublished generated column as REPLICA IDENTITY