Re: define pg_structiszero(addr, s, r)

From: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: define pg_structiszero(addr, s, r)
Date: 2024-11-06 11:38:30
Message-ID: ZytVNvIbc1vc0qZz@ip-10-97-1-34.eu-west-3.compute.internal
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Wed, Nov 06, 2024 at 12:16:33PM +1300, David Rowley wrote:
> On Wed, 6 Nov 2024 at 04:03, Bertrand Drouvot
> <bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
> > Another option could be to use SIMD instructions to check multiple bytes
> > is zero in a single operation. Maybe just an idea to keep in mind and experiment
> > if we feel the need later on.
>
> Could do. I just wrote it that way to give the compiler flexibility to
> do SIMD implicitly.

ohhh, great, thanks!

> That seemed easier than messing around with SIMD
> intrinsics.

I had in mind to use SIMD intrinsics actually when posting the SIMD idea but...

> I guess the compiler won't use SIMD with the single
> size_t-at-a-time version as it can't be certain it's ok to access the
> memory beyond the first zero word. Because I wrote the "if" condition
> using bitwise-OR, there's no boolean short-circuiting, so the compiler
> sees it must be safe to access all the memory for the loop iteration.

that's a better idea! Yeah, I think that now the compiler sees that all comparisons
can be done in parallel and combined with a single OR operation (so, good candidate
to use SIMD optimization).

> If I use -march=native or -march=znver2 on my Zen2 machine, gcc does
> use SIMD operators. Clang uses some 128-bit registers without
> specifying -march:
>
> drowley(at)amd3990x:~$ gcc -O2 allzeros.c -march=native -o allzeros &&
> for i in {1..3}; do ./allzeros; done
> char: done in 1940539 nanoseconds
> size_t: done in 261731 nanoseconds (7.41425 times faster than char)
> size_t * 4: done in 130415 nanoseconds (14.8797 times faster than char)
> size_t * 8: done in 70031 nanoseconds (27.7097 times faster than char)
> char: done in 3030132 nanoseconds
> size_t: done in 477044 nanoseconds (6.35189 times faster than char)
> size_t * 4: done in 123551 nanoseconds (24.5254 times faster than char)
> size_t * 8: done in 68549 nanoseconds (44.2039 times faster than char)
> char: done in 3214037 nanoseconds
> size_t: done in 256901 nanoseconds (12.5108 times faster than char)
> size_t * 4: done in 126017 nanoseconds (25.5048 times faster than char)
> size_t * 8: done in 73167 nanoseconds (43.9274 times faster than char)
>

Thanks for the tests! Out of curiosity, using gcc 11.4.0 (SIMD instructions not
generated) and get:

$ gcc -O2 allzeros_simd.c -o allzeros_simd ; ./allzeros_simd
char: done in 2655385 nanoseconds
size_t: done in 476021 nanoseconds (5.57829 times faster than char)
size_t SIMD DAVID: done in 174816 nanoseconds (15.1896 times faster than char)

or

$ gcc -march=native -O2 allzeros_simd.c -o allzeros_simd ; ./allzeros_simd
char: done in 2681146 nanoseconds
size_t: done in 395041 nanoseconds (6.78701 times faster than char)
size_t SIMD DAVID: done in 175608 nanoseconds (15.2678 times faster than char)

=> It's faster than the size_t one.

But of course, it's even faster with SIMD:

$ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -O2 allzeros_simd.c -o allzeros_simd ; ./allzeros_simd
char: done in 5318674 nanoseconds
size_t: done in 443591 nanoseconds (11.99 times faster than char)
size_t SIMD DAVID: done in 179650 nanoseconds (29.6058 times faster than char)

or

$ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -march=native -O2 allzeros_simd.c -o allzeros_simd ; ./allzeros_simd
char: done in 5319534 nanoseconds
size_t: done in 426599 nanoseconds (12.4696 times faster than char)
size_t SIMD DAVID: done in 128687 nanoseconds (41.337 times faster than char)

So, I don't see any reason why not to use this SIMD approach: please find v7
attached.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v7-0001-Optimize-pg_memory_is_all_zeros.patch text/x-diff 3.7 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bertrand Drouvot 2024-11-06 11:40:06 Re: define pg_structiszero(addr, s, r)
Previous Message Amit Kapila 2024-11-06 11:23:19 Re: Commit Timestamp and LSN Inversion issue