Re: [PATCH] SVE popcount support

From: Nathan Bossart <nathandbossart(at)gmail(dot)com>
To: "Chiranmoy(dot)Bhattacharya(at)fujitsu(dot)com" <Chiranmoy(dot)Bhattacharya(at)fujitsu(dot)com>
Cc: "Malladi, Rama" <ramamalladi(at)hotmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, "Ragesh(dot)Hajela(at)fujitsu(dot)com" <Ragesh(dot)Hajela(at)fujitsu(dot)com>, Salvatore Dipietro <dipiets(at)amazon(dot)com>, "Devanga(dot)Susmitha(at)fujitsu(dot)com" <Devanga(dot)Susmitha(at)fujitsu(dot)com>
Subject: Re: [PATCH] SVE popcount support
Date: 2025-02-05 16:11:05
Message-ID: Z6ONmQVSD5Qnpbsl@nathan
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 04, 2025 at 09:01:33AM +0000, Chiranmoy(dot)Bhattacharya(at)fujitsu(dot)com wrote:
>> + /*
>> + * For smaller inputs, aligning the buffer degrades the performance.
>> + * Therefore, the buffers only when the input size is sufficiently large.
>> + */
>
>> Is the inverse true, i.e., does aligning the buffer improve performance for
>> larger inputs? I'm also curious what level of performance degradation you
>> were seeing.
>
> Here is a comparison of all three cases. Alignment is marginally better for inputs
> above 1024B, but the difference is small. Unaligned performs better for smaller inputs.
> Aligned After 128B => the current implementation "if (aligned != buf && bytes > 4 * vec_len)"
> Always Aligned => condition "bytes > 4 * vec_len" is removed.
> Unaligned => the whole if block was removed
>
> buf | Always Aligned | Aligned After 128B | Unaligned
> --------+---------------+--------------------+------------
> 16 | 37.851 | 38.203 | 34.971
> 32 | 37.859 | 38.187 | 34.972
> 64 | 37.611 | 37.405 | 34.121
> 128 | 45.357 | 45.897 | 41.890
> 256 | 62.440 | 63.454 | 58.666
> 512 | 100.120 | 102.767 | 99.861
> 1024 | 159.574 | 158.594 | 164.975
> 2048 | 282.354 | 281.198 | 283.937
> 4096 | 532.038 | 531.068 | 533.699
> 8192 | 1038.973 | 1038.083 | 1039.206
> 16384 | 2028.604 | 2025.843 | 2033.940

Hm. These results are so similar that I'm tempted to suggest we just
remove the section of code dedicated to alignment. Is there any reason not
to do that?

+ /* Process 2 complete vectors */
+ for (; i < loop_bytes; i += vec_len * 2)
+ {
+ vec64 = svand_x(pred, svld1(pred, (const uint64 *) (buf + i)), mask64);
+ accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec64));
+ vec64 = svand_x(pred, svld1(pred, (const uint64 *) (buf + i + vec_len)), mask64);
+ accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec64));
+ }

Does this hand-rolled loop unrolling offer any particular advantage? What
do the numbers look like if we don't do this or if we process, say, 4
vectors at a time?

--
nathan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2025-02-05 16:26:48 Re: Better title output for psql \dt \di etc. commands
Previous Message Tom Lane 2025-02-05 16:05:15 Re: Remove unnecessary static specifier