Re: Popcount optimization using SVE for ARM

From: Kirill Reshke <reshkekirill(at)gmail(dot)com>
To: "Devanga(dot)Susmitha(at)fujitsu(dot)com" <Devanga(dot)Susmitha(at)fujitsu(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, "Ragesh(dot)Hajela(at)fujitsu(dot)com" <Ragesh(dot)Hajela(at)fujitsu(dot)com>, "Chiranmoy(dot)Bhattacharya(at)fujitsu(dot)com" <Chiranmoy(dot)Bhattacharya(at)fujitsu(dot)com>, "Rajat(dot)Ma(at)fujitsu(dot)com" <Rajat(dot)Ma(at)fujitsu(dot)com>
Subject: Re: Popcount optimization using SVE for ARM
Date: 2024-12-06 07:22:19
Message-ID: CALdSSPj+A2wDKzuqeVR7UheFGemrXcE7pkRAkNh3YHgs6m3Nqw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 6 Dec 2024 at 10:54, Devanga(dot)Susmitha(at)fujitsu(dot)com <
Devanga(dot)Susmitha(at)fujitsu(dot)com> wrote:

> Hello, This email is to discuss the contribution of the speed-up
> popcount and popcount mask feature we have developed for the ARM
> architecture using SVE intrinsics.
> The current method for popcount on ARM relies on compiler intrinsics or C
> code, which processes data in a scalar fashion, handling one integer at a
> time. By leveraging SVE intrinsics for popcount, the execution can process
> multiple integers simultaneously, depending on the vector length, thereby
> significantly enhancing the performance of the functionality.
> We have designed this feature to ensure compatibility and robustness. It
> includes compile-time and runtime checks for SVE compatibility with both
> the compiler and hardware. If either check fails, the code falls back to
> the existing scalar implementation, ensuring fail-safe operation.
> Additionally, we leveraged the existing infrastructure to select between
> different popcount implementations, avoiding additional complexity.
>
> *Algorithm Overview:*
> 1. For larger inputs, align the buffers to avoid double loads. For smaller
> inputs alignment is not necessary and might even degrade the performance.
> 2. Process the aligned buffer chunk by chunk till the last incomplete
> chunk.
> 3. Process the last incomplete chunk.
> *Our setup:*
> Machine: AWS EC2 c7g.8xlarge - 32vcpu, 64gb RAM
> OS : Ubuntu 22.04.5 LTS
> GCC: 11.4
>
> *Benchmark and Result:*
> We have used John Naylor's popcount-test-module [0] for benchmarking and
> observed a speed-up of more than 3x for larger buffers. Even for smaller
> inputs of size 8 and 32 bytes there aren't any performance degradations
> observed.
>
>
>
> We would like to contribute our above work so that it can be available for
> the community to utilize. To do so, we are following the procedure
> mentioned in Submitting a Patch - PostgreSQL wiki
> <https://wiki.postgresql.org/wiki/Submitting_a_Patch>. *Please find the
> attachments for the patch and performance results.*
> Please let us know if you have any queries or suggestions.
>
>
> Thanks & Regards,
> Susmitha Devanga.
>
Hi! Is this patch somehow related to [0] ?

[0]
https://www.postgresql.org/message-id/010101936e4aaa70-b474ab9e-b9ce-474d-a3ba-a3dc223d295c-000000%40us-west-2.amazonses.com

--
Best regards,
Kirill Reshke

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kirill Reshke 2024-12-06 07:30:56 Re: Use streaming read API in pgstattuple.
Previous Message Sutou Kouhei 2024-12-06 07:20:42 Re: confusing / inefficient "need_transcoding" handling in copy