Re: [PATCH] SVE popcount support

From: "Chiranmoy(dot)Bhattacharya(at)fujitsu(dot)com" <Chiranmoy(dot)Bhattacharya(at)fujitsu(dot)com>
To: Nathan Bossart <nathandbossart(at)gmail(dot)com>
Cc: "Malladi, Rama" <ramamalladi(at)hotmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, "Ragesh(dot)Hajela(at)fujitsu(dot)com" <Ragesh(dot)Hajela(at)fujitsu(dot)com>, Salvatore Dipietro <dipiets(at)amazon(dot)com>, "Devanga(dot)Susmitha(at)fujitsu(dot)com" <Devanga(dot)Susmitha(at)fujitsu(dot)com>
Subject: Re: [PATCH] SVE popcount support
Date: 2025-02-06 08:44:35
Message-ID: TY2PR01MB26673A2C028501C981E84CD697F62@TY2PR01MB2667.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> Hm. These results are so similar that I'm tempted to suggest we just
> remove the section of code dedicated to alignment. Is there any reason not
> to do that?

It seems that the double load overhead from unaligned memory access isn’t
too taxing, even on larger inputs. We can remove it to simplify the code.

> Does this hand-rolled loop unrolling offer any particular advantage? What
> do the numbers look like if we don't do this or if we process, say, 4
> vectors at a time?

The unrolled version performs better than the non-unrolled one, but
processing four vectors provides no additional benefit. The numbers
and code used are given below.

buf | Not Unrolled | Unrolled x2 | Unrolled x4
------+-------------+-------------+-------------
16 | 4.774 | 4.759 | 5.634
32 | 6.872 | 6.486 | 7.348
64 | 11.070 | 10.249 | 10.617
128 | 20.003 | 16.205 | 16.764
256 | 40.234 | 28.377 | 29.108
512 | 83.825 | 53.420 | 53.658
1024 | 191.181 | 101.677 | 102.727
2048 | 389.160 | 200.291 | 201.544
4096 | 785.742 | 404.593 | 399.134
8192 | 1587.226 | 811.314 | 810.961

/* Process 4 vectors */
for (; i < loop_bytes; i += vec_len * 4)
{
      vec64_1 = svld1(pred, (const uint64 *) (buf + i));
      accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec64_1));
      vec64_2 = svld1(pred, (const uint64 *) (buf + i + vec_len));
      accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec64_2));

      vec64_3 = svld1(pred, (const uint64 *) (buf + i + 2 * vec_len));
      accum3 = svadd_x(pred, accum3, svcnt_x(pred, vec64_3));
      vec64_4 = svld1(pred, (const uint64 *) (buf + i + 3 * vec_len));
      accum4 = svadd_x(pred, accum4, svcnt_x(pred, vec64_4));
}

-Chiranmoy

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hayato Kuroda (Fujitsu) 2025-02-06 08:55:39 RE: Improving tracking/processing of buildfarm test failures
Previous Message Michael Paquier 2025-02-06 08:39:25 Re: Show WAL write and fsync stats in pg_stat_io