| From: | Nathan Bossart <nathandbossart(at)gmail(dot)com> | 
|---|---|
| To: | "Chiranmoy(dot)Bhattacharya(at)fujitsu(dot)com" <Chiranmoy(dot)Bhattacharya(at)fujitsu(dot)com> | 
| Cc: | "Malladi, Rama" <ramamalladi(at)hotmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, "Ragesh(dot)Hajela(at)fujitsu(dot)com" <Ragesh(dot)Hajela(at)fujitsu(dot)com>, Salvatore Dipietro <dipiets(at)amazon(dot)com>, "Devanga(dot)Susmitha(at)fujitsu(dot)com" <Devanga(dot)Susmitha(at)fujitsu(dot)com> | 
| Subject: | Re: [PATCH] SVE popcount support | 
| Date: | 2025-02-05 16:11:05 | 
| Message-ID: | Z6ONmQVSD5Qnpbsl@nathan | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
On Tue, Feb 04, 2025 at 09:01:33AM +0000, Chiranmoy(dot)Bhattacharya(at)fujitsu(dot)com wrote:
>> +    /*
>> +     * For smaller inputs, aligning the buffer degrades the performance.
>> +     * Therefore, the buffers only when the input size is sufficiently large.
>> +     */
> 
>> Is the inverse true, i.e., does aligning the buffer improve performance for
>> larger inputs?  I'm also curious what level of performance degradation you
>> were seeing.
> 
> Here is a comparison of all three cases. Alignment is marginally better for inputs
> above 1024B, but the difference is small. Unaligned performs better for smaller inputs.
> Aligned After 128B => the current implementation "if (aligned != buf && bytes > 4 * vec_len)"
> Always Aligned => condition "bytes > 4 * vec_len" is removed.
> Unaligned => the whole if block was removed
> 
>  buf    | Always Aligned | Aligned After 128B | Unaligned
> --------+---------------+--------------------+------------
>    16   |       37.851  |           38.203   |     34.971
>    32   |       37.859  |           38.187   |     34.972
>    64   |       37.611  |           37.405   |     34.121
>   128   |       45.357  |           45.897   |     41.890
>   256   |       62.440  |           63.454   |     58.666
>   512   |      100.120  |          102.767   |     99.861
>  1024   |      159.574  |          158.594   |    164.975
>  2048   |      282.354  |          281.198   |    283.937
>  4096   |      532.038  |          531.068   |    533.699
>  8192   |     1038.973  |         1038.083   |   1039.206
> 16384   |     2028.604  |         2025.843   |   2033.940
Hm.  These results are so similar that I'm tempted to suggest we just
remove the section of code dedicated to alignment.  Is there any reason not
to do that?
+	/* Process 2 complete vectors */
+	for (; i < loop_bytes; i += vec_len * 2)
+	{
+		vec64 = svand_x(pred, svld1(pred, (const uint64 *) (buf + i)), mask64);
+		accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec64));
+		vec64 = svand_x(pred, svld1(pred, (const uint64 *) (buf + i + vec_len)), mask64);
+		accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec64));
+	}
Does this hand-rolled loop unrolling offer any particular advantage?  What
do the numbers look like if we don't do this or if we process, say, 4
vectors at a time?
-- 
nathan
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tom Lane | 2025-02-05 16:26:48 | Re: Better title output for psql \dt \di etc. commands | 
| Previous Message | Tom Lane | 2025-02-05 16:05:15 | Re: Remove unnecessary static specifier |