From: | Kirill Reshke <reshkekirill(at)gmail(dot)com> |
---|---|
To: | "Devanga(dot)Susmitha(at)fujitsu(dot)com" <Devanga(dot)Susmitha(at)fujitsu(dot)com> |
Cc: | "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, "Ragesh(dot)Hajela(at)fujitsu(dot)com" <Ragesh(dot)Hajela(at)fujitsu(dot)com>, "Chiranmoy(dot)Bhattacharya(at)fujitsu(dot)com" <Chiranmoy(dot)Bhattacharya(at)fujitsu(dot)com>, "Rajat(dot)Ma(at)fujitsu(dot)com" <Rajat(dot)Ma(at)fujitsu(dot)com> |
Subject: | Re: Popcount optimization using SVE for ARM |
Date: | 2024-12-06 07:22:19 |
Message-ID: | CALdSSPj+A2wDKzuqeVR7UheFGemrXcE7pkRAkNh3YHgs6m3Nqw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, 6 Dec 2024 at 10:54, Devanga(dot)Susmitha(at)fujitsu(dot)com <
Devanga(dot)Susmitha(at)fujitsu(dot)com> wrote:
> Hello, This email is to discuss the contribution of the speed-up
> popcount and popcount mask feature we have developed for the ARM
> architecture using SVE intrinsics.
> The current method for popcount on ARM relies on compiler intrinsics or C
> code, which processes data in a scalar fashion, handling one integer at a
> time. By leveraging SVE intrinsics for popcount, the execution can process
> multiple integers simultaneously, depending on the vector length, thereby
> significantly enhancing the performance of the functionality.
> We have designed this feature to ensure compatibility and robustness. It
> includes compile-time and runtime checks for SVE compatibility with both
> the compiler and hardware. If either check fails, the code falls back to
> the existing scalar implementation, ensuring fail-safe operation.
> Additionally, we leveraged the existing infrastructure to select between
> different popcount implementations, avoiding additional complexity.
>
> *Algorithm Overview:*
> 1. For larger inputs, align the buffers to avoid double loads. For smaller
> inputs alignment is not necessary and might even degrade the performance.
> 2. Process the aligned buffer chunk by chunk till the last incomplete
> chunk.
> 3. Process the last incomplete chunk.
> *Our setup:*
> Machine: AWS EC2 c7g.8xlarge - 32vcpu, 64gb RAM
> OS : Ubuntu 22.04.5 LTS
> GCC: 11.4
>
> *Benchmark and Result:*
> We have used John Naylor's popcount-test-module [0] for benchmarking and
> observed a speed-up of more than 3x for larger buffers. Even for smaller
> inputs of size 8 and 32 bytes there aren't any performance degradations
> observed.
>
>
>
> We would like to contribute our above work so that it can be available for
> the community to utilize. To do so, we are following the procedure
> mentioned in Submitting a Patch - PostgreSQL wiki
> <https://wiki.postgresql.org/wiki/Submitting_a_Patch>. *Please find the
> attachments for the patch and performance results.*
> Please let us know if you have any queries or suggestions.
>
>
> Thanks & Regards,
> Susmitha Devanga.
>
Hi! Is this patch somehow related to [0] ?
--
Best regards,
Kirill Reshke
From | Date | Subject | |
---|---|---|---|
Next Message | Kirill Reshke | 2024-12-06 07:30:56 | Re: Use streaming read API in pgstattuple. |
Previous Message | Sutou Kouhei | 2024-12-06 07:20:42 | Re: confusing / inefficient "need_transcoding" handling in copy |