From: | "Chiranmoy(dot)Bhattacharya(at)fujitsu(dot)com" <Chiranmoy(dot)Bhattacharya(at)fujitsu(dot)com> |
---|---|
To: | Nathan Bossart <nathandbossart(at)gmail(dot)com> |
Cc: | "Malladi, Rama" <ramamalladi(at)hotmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, "Ragesh(dot)Hajela(at)fujitsu(dot)com" <Ragesh(dot)Hajela(at)fujitsu(dot)com>, Salvatore Dipietro <dipiets(at)amazon(dot)com>, "Devanga(dot)Susmitha(at)fujitsu(dot)com" <Devanga(dot)Susmitha(at)fujitsu(dot)com> |
Subject: | Re: [PATCH] SVE popcount support |
Date: | 2025-02-06 08:44:35 |
Message-ID: | TY2PR01MB26673A2C028501C981E84CD697F62@TY2PR01MB2667.jpnprd01.prod.outlook.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
> Hm. These results are so similar that I'm tempted to suggest we just
> remove the section of code dedicated to alignment. Is there any reason not
> to do that?
It seems that the double load overhead from unaligned memory access isn’t
too taxing, even on larger inputs. We can remove it to simplify the code.
> Does this hand-rolled loop unrolling offer any particular advantage? What
> do the numbers look like if we don't do this or if we process, say, 4
> vectors at a time?
The unrolled version performs better than the non-unrolled one, but
processing four vectors provides no additional benefit. The numbers
and code used are given below.
buf | Not Unrolled | Unrolled x2 | Unrolled x4
------+-------------+-------------+-------------
16 | 4.774 | 4.759 | 5.634
32 | 6.872 | 6.486 | 7.348
64 | 11.070 | 10.249 | 10.617
128 | 20.003 | 16.205 | 16.764
256 | 40.234 | 28.377 | 29.108
512 | 83.825 | 53.420 | 53.658
1024 | 191.181 | 101.677 | 102.727
2048 | 389.160 | 200.291 | 201.544
4096 | 785.742 | 404.593 | 399.134
8192 | 1587.226 | 811.314 | 810.961
/* Process 4 vectors */
for (; i < loop_bytes; i += vec_len * 4)
{
vec64_1 = svld1(pred, (const uint64 *) (buf + i));
accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec64_1));
vec64_2 = svld1(pred, (const uint64 *) (buf + i + vec_len));
accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec64_2));
vec64_3 = svld1(pred, (const uint64 *) (buf + i + 2 * vec_len));
accum3 = svadd_x(pred, accum3, svcnt_x(pred, vec64_3));
vec64_4 = svld1(pred, (const uint64 *) (buf + i + 3 * vec_len));
accum4 = svadd_x(pred, accum4, svcnt_x(pred, vec64_4));
}
-Chiranmoy
From | Date | Subject | |
---|---|---|---|
Next Message | Hayato Kuroda (Fujitsu) | 2025-02-06 08:55:39 | RE: Improving tracking/processing of buildfarm test failures |
Previous Message | Michael Paquier | 2025-02-06 08:39:25 | Re: Show WAL write and fsync stats in pg_stat_io |