Re: AIO v2.0

From: Andres Freund <andres(at)anarazel(dot)de>
To: Ants Aasma <ants(dot)aasma(at)cybertec(dot)at>
Cc: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, 陈宗志 <baotiao(at)gmail(dot)com>
Subject: Re: AIO v2.0
Date: 2025-01-09 20:53:20
Message-ID: 6y5xyw3q2773mvvsjgap27js3guklxxgjy5o24f67vkkjliubv@pio54caabde2
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2025-01-09 20:10:24 +0200, Ants Aasma wrote:
> On Thu, 9 Jan 2025 at 18:25, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > > I'm curious about this because the checksum code should be fast enough
> > > to easily handle that throughput.
> >
> > It seems to top out at about ~5-6 GB/s on my 2x Xeon Gold 6442Y
> > workstation. But we don't have a good ready-made way of testing that without
> > also doing IO, so it's kinda hard to say.
>
> Interesting, I wonder if it's related to Intel increasing vpmulld
> latency to 10 already back in Haswell. The Zen 3 I'm testing on has
> latency 3 and has twice the throughput.

> Attached is a naive and crude benchmark that I used for testing here.
> Compiled with:
>
> gcc -O2 -funroll-loops -ftree-vectorize -march=native \
> -I$(pg_config --includedir-server) \
> bench-checksums.c -o bench-checksums-native
>
> Just fills up an array of pages and checksums them, first argument is
> number of checksums, second is array size. I used 1M checksums and 100
> pages for in cache behavior and 100000 pages for in memory
> performance.
>
> 869.85927ms @ 9.418 GB/s - generic from memory
> 772.12252ms @ 10.610 GB/s - generic in cache
> 442.61869ms @ 18.508 GB/s - native from memory
> 137.07573ms @ 59.763 GB/s - native in cache

printf '%16s\t%16s\t%s\n' march mem result; for mem in 100 100000 1000000; do for march in x86-64 x86-64-v2 x86-64-v3 x86-64-v4 native; do printf "%16s\t%16s\t" $march $mem; gcc -g -g3 -O2 -funroll-loops -ftree-vectorize -march=$march -I ~/src/postgresql/src/include/ -I src/include/ /tmp/bench-checksums.c -o bench-checksums-native && numactl --physcpubind 1 --membind 0 ./bench-checksums-native 1000000 $mem;done; done

Workstation w/ 2x Xeon Gold 6442Y:

march mem result
x86-64 100 731.87779ms @ 11.193 GB/s
x86-64-v2 100 327.18580ms @ 25.038 GB/s
x86-64-v3 100 264.03547ms @ 31.026 GB/s
x86-64-v4 100 282.08065ms @ 29.041 GB/s
native 100 246.13766ms @ 33.282 GB/s
x86-64 100000 842.66827ms @ 9.722 GB/s
x86-64-v2 100000 604.52959ms @ 13.551 GB/s
x86-64-v3 100000 477.16239ms @ 17.168 GB/s
x86-64-v4 100000 476.07039ms @ 17.208 GB/s
native 100000 456.08080ms @ 17.962 GB/s
x86-64 1000000 845.51132ms @ 9.689 GB/s
x86-64-v2 1000000 612.07973ms @ 13.384 GB/s
x86-64-v3 1000000 485.23738ms @ 16.882 GB/s
x86-64-v4 1000000 483.86411ms @ 16.930 GB/s
native 1000000 462.88461ms @ 17.698 GB/s

Zen 4 laptop (AMD Ryzen 7 PRO 7840U):
march mem result
x86-64 100 417.19762ms @ 19.636 GB/s
x86-64-v2 100 130.67596ms @ 62.689 GB/s
x86-64-v3 100 97.07758ms @ 84.386 GB/s
x86-64-v4 100 95.67704ms @ 85.621 GB/s
native 100 95.15734ms @ 86.089 GB/s
x86-64 100000 431.38370ms @ 18.990 GB/s
x86-64-v2 100000 215.74856ms @ 37.970 GB/s
x86-64-v3 100000 199.74492ms @ 41.012 GB/s
x86-64-v4 100000 186.98300ms @ 43.811 GB/s
native 100000 187.68125ms @ 43.648 GB/s
x86-64 1000000 433.87893ms @ 18.881 GB/s
x86-64-v2 1000000 217.46561ms @ 37.670 GB/s
x86-64-v3 1000000 200.40667ms @ 40.877 GB/s
x86-64-v4 1000000 187.51978ms @ 43.686 GB/s
native 1000000 190.29273ms @ 43.049 GB/s

Workstation w/ 2x Xeon Gold 5215:
march mem result
x86-64 100 780.38881ms @ 10.497 GB/s
x86-64-v2 100 389.62005ms @ 21.026 GB/s
x86-64-v3 100 323.97294ms @ 25.286 GB/s
x86-64-v4 100 274.19493ms @ 29.877 GB/s
native 100 283.48674ms @ 28.897 GB/s
x86-64 100000 1112.63898ms @ 7.363 GB/s
x86-64-v2 100000 831.45641ms @ 9.853 GB/s
x86-64-v3 100000 696.20789ms @ 11.767 GB/s
x86-64-v4 100000 685.61636ms @ 11.948 GB/s
native 100000 689.78023ms @ 11.876 GB/s
x86-64 1000000 1128.65580ms @ 7.258 GB/s
x86-64-v2 1000000 843.92594ms @ 9.707 GB/s
x86-64-v3 1000000 718.78848ms @ 11.397 GB/s
x86-64-v4 1000000 687.68258ms @ 11.912 GB/s
native 1000000 705.34731ms @ 11.614 GB/s

That's quite the drastic difference between amd and intel. Of course it's also
comparing a multi-core server uarch (lower per-core bandwidth, much higher
aggregate bandwidth) with a client uarch.

The difference between the baseline CPU target and a more modern profile is
also rather impressive. Looks like some cpu-capability based dispatch would
likely be worth it, even if it didn't matter in my case due to -march=native.

I just realized that

a) The meson build doesn't use the relevant flags for bufpage.c - it didn't
matter in my numbers though because I was building with -O3 and
march=native.

This clearly ought to be fixed.

b) Neither build uses the optimized flags for pg_checksum and pg_upgrade, both
of which include checksum_imp.h directly.

This probably should be fixed too - perhaps by building the relevant code
once as part of fe_utils or such?

It probably matters less than it used to - these days -O2 turns on
-ftree-loop-vectorize -ftree-slp-vectorize. But loop unrolling isn't
enabled.

I do see a perf difference at -O2 between using/not using
-funroll-loops. Interestingly not at -O3, despite -funroll-loops not actually
being enabled by -O3. I think the relevant option that *is* turned on by O3 is
-fpeel-loops.

Here's a comparison of different flags run the 6442Y

printf '%16s\t%32s\t%16s\t%s\n' march flags mem result; for mem in 100 100000; do for march in x86-64 x86-64-v2 x86-64-v3 x86-64-v4 native; do for flags in "-O2" "-O2 -funroll-loops" "-O3" "-O3 -funroll-loops"; do printf "%16s\t%32s\t%16s\t" "$march" "$flags" "$mem"; gcc $flags -march=$march -I ~/src/postgresql/src/include/ -I src/include/ /tmp/bench-checksums.c -o bench-checksums-native && numactl --physcpubind 3 --membind 0 ./bench-checksums-native 3000000 $mem;done; done;done
march flags mem result
x86-64 -O2 100 2280.86253ms @ 10.775 GB/s
x86-64 -O2 -funroll-loops 100 2195.66942ms @ 11.193 GB/s
x86-64 -O3 100 2422.57588ms @ 10.145 GB/s
x86-64 -O3 -funroll-loops 100 2243.75826ms @ 10.953 GB/s
x86-64-v2 -O2 100 1243.68063ms @ 19.761 GB/s
x86-64-v2 -O2 -funroll-loops 100 979.67783ms @ 25.086 GB/s
x86-64-v2 -O3 100 988.80296ms @ 24.854 GB/s
x86-64-v2 -O3 -funroll-loops 100 991.31632ms @ 24.791 GB/s
x86-64-v3 -O2 100 1146.90165ms @ 21.428 GB/s
x86-64-v3 -O2 -funroll-loops 100 785.81395ms @ 31.275 GB/s
x86-64-v3 -O3 100 800.53627ms @ 30.699 GB/s
x86-64-v3 -O3 -funroll-loops 100 790.21230ms @ 31.101 GB/s
x86-64-v4 -O2 100 883.82916ms @ 27.806 GB/s
x86-64-v4 -O2 -funroll-loops 100 831.55372ms @ 29.554 GB/s
x86-64-v4 -O3 100 843.23141ms @ 29.145 GB/s
x86-64-v4 -O3 -funroll-loops 100 821.19969ms @ 29.927 GB/s
native -O2 100 1197.41357ms @ 20.524 GB/s
native -O2 -funroll-loops 100 718.05253ms @ 34.226 GB/s
native -O3 100 747.94090ms @ 32.858 GB/s
native -O3 -funroll-loops 100 751.52379ms @ 32.702 GB/s
x86-64 -O2 100000 2911.47087ms @ 8.441 GB/s
x86-64 -O2 -funroll-loops 100000 2525.45504ms @ 9.731 GB/s
x86-64 -O3 100000 2497.42016ms @ 9.841 GB/s
x86-64 -O3 -funroll-loops 100000 2346.33551ms @ 10.474 GB/s
x86-64-v2 -O2 100000 2124.10102ms @ 11.570 GB/s
x86-64-v2 -O2 -funroll-loops 100000 1819.09659ms @ 13.510 GB/s
x86-64-v2 -O3 100000 1613.45823ms @ 15.232 GB/s
x86-64-v2 -O3 -funroll-loops 100000 1607.09245ms @ 15.292 GB/s
x86-64-v3 -O2 100000 1972.89390ms @ 12.457 GB/s
x86-64-v3 -O2 -funroll-loops 100000 1432.58229ms @ 17.155 GB/s
x86-64-v3 -O3 100000 1533.18003ms @ 16.029 GB/s
x86-64-v3 -O3 -funroll-loops 100000 1539.39779ms @ 15.965 GB/s
x86-64-v4 -O2 100000 1591.96881ms @ 15.437 GB/s
x86-64-v4 -O2 -funroll-loops 100000 1434.91828ms @ 17.127 GB/s
x86-64-v4 -O3 100000 1454.30133ms @ 16.899 GB/s
x86-64-v4 -O3 -funroll-loops 100000 1429.13733ms @ 17.196 GB/s
native -O2 100000 1980.53734ms @ 12.409 GB/s
native -O2 -funroll-loops 100000 1373.95337ms @ 17.887 GB/s
native -O3 100000 1517.90164ms @ 16.191 GB/s
native -O3 -funroll-loops 100000 1508.37021ms @ 16.293 GB/s

> > > Is it just that the calculation is slow, or is it the fact that checksumming
> > > needs to bring the page into the CPU cache. Did you notice any hints which
> > > might be the case?
> >
> > I don't think the issue is that checksumming pulls the data into CPU caches
> >
> > 1) This is visible with SELECT that actually uses the data
> >
> > 2) I added prefetching to avoid any meaningful amount of cache misses and it
> > doesn't change the overall timing much
> >
> > 3) It's visible with buffered IO, which has pulled the data into CPU caches
> > already
>
> I didn't yet check the code, when doing aio completions checksumming
> be running on the same core as is going to be using the page?

With io_uring normally yes, the exception being that another backend that
needs the same page could end up running the completion.

With worker mode normally no.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michail Nikolaev 2025-01-09 21:00:03 Re: Why doesn't GiST VACUUM require a super-exclusive lock, like nbtree VACUUM?
Previous Message Sami Imseih 2025-01-09 20:50:18 Re: Psql meta-command conninfo+