Re: [PATCH] SVE popcount support

From: Nathan Bossart <nathandbossart(at)gmail(dot)com>
To: "Chiranmoy(dot)Bhattacharya(at)fujitsu(dot)com" <Chiranmoy(dot)Bhattacharya(at)fujitsu(dot)com>
Cc: "Malladi, Rama" <ramamalladi(at)hotmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, "Ragesh(dot)Hajela(at)fujitsu(dot)com" <Ragesh(dot)Hajela(at)fujitsu(dot)com>, Salvatore Dipietro <dipiets(at)amazon(dot)com>, "Devanga(dot)Susmitha(at)fujitsu(dot)com" <Devanga(dot)Susmitha(at)fujitsu(dot)com>
Subject: Re: [PATCH] SVE popcount support
Date: 2025-03-12 18:32:07
Message-ID: Z9HTJ7VUlORFod6T@nathan
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Mar 12, 2025 at 10:34:46AM +0000, Chiranmoy(dot)Bhattacharya(at)fujitsu(dot)com wrote:
> On Wed, Mar 12, 2025 at 02:41:18AM +0000, nathandbossart(at)gmail(dot)com wrote:
>
>> v5-no-sve is the result of using a function pointer, but pointing to the
>> "slow" versions instead of the SVE version. v5-sve is the result of the
>> latest patch in this thread on a machine with SVE support, and v5-4reg is
>> the result of the latest patch in this thread modified to process 4
>> register's worth of data at a time.
>
> Nice, I wonder why I did not observe any performance gain in the 4reg
> version. Did you modify the 4reg version code?
>
> One possible explanation is that you used Graviton4 based instances
> whereas I used Graviton3 instances.

Yeah, it looks like the number of vector registers is different [0].

>> For the latter point, I think we should consider trying to add a separate
>> Neon implementation that we use as a fallback for machines that don't have
>> SVE. My understanding is that Neon is virtually universally supported on
>> 64-bit Arm gear, so that will not only help offset the function pointer
>> overhead but may even improve performance for a much wider set of machines.
>
> I have added the NEON implementation in the latest patch.
>
> Here are the numbers for drive_popcount(1000000, 1024) on m7g.8xlarge:
> Scalar - 692ms
> Neon - 298ms
> SVE - 112ms

Those are nice results. I'm a little worried about the Neon implementation
for smaller inputs since it uses a per-byte loop for the remaining bytes,
though. If we can ensure there's no regression there, I think this patch
will be in decent shape.

[0] https://github.com/aws/aws-graviton-getting-started?tab=readme-ov-file#building-for-graviton

--
nathan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Álvaro Herrera 2025-03-12 18:34:16 Re: remove open-coded popcount in acl.c
Previous Message Alvaro Herrera 2025-03-12 18:20:13 Re: Support NOT VALID / VALIDATE constraint options for named NOT NULL constraints