From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Ants Aasma <ants(dot)aasma(at)cybertec(dot)at> |
Cc: | Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, 陈宗志 <baotiao(at)gmail(dot)com> |
Subject: | Re: AIO v2.0 |
Date: | 2025-01-09 20:53:20 |
Message-ID: | 6y5xyw3q2773mvvsjgap27js3guklxxgjy5o24f67vkkjliubv@pio54caabde2 |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2025-01-09 20:10:24 +0200, Ants Aasma wrote:
> On Thu, 9 Jan 2025 at 18:25, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > > I'm curious about this because the checksum code should be fast enough
> > > to easily handle that throughput.
> >
> > It seems to top out at about ~5-6 GB/s on my 2x Xeon Gold 6442Y
> > workstation. But we don't have a good ready-made way of testing that without
> > also doing IO, so it's kinda hard to say.
>
> Interesting, I wonder if it's related to Intel increasing vpmulld
> latency to 10 already back in Haswell. The Zen 3 I'm testing on has
> latency 3 and has twice the throughput.
> Attached is a naive and crude benchmark that I used for testing here.
> Compiled with:
>
> gcc -O2 -funroll-loops -ftree-vectorize -march=native \
> -I$(pg_config --includedir-server) \
> bench-checksums.c -o bench-checksums-native
>
> Just fills up an array of pages and checksums them, first argument is
> number of checksums, second is array size. I used 1M checksums and 100
> pages for in cache behavior and 100000 pages for in memory
> performance.
>
> 869.85927ms @ 9.418 GB/s - generic from memory
> 772.12252ms @ 10.610 GB/s - generic in cache
> 442.61869ms @ 18.508 GB/s - native from memory
> 137.07573ms @ 59.763 GB/s - native in cache
printf '%16s\t%16s\t%s\n' march mem result; for mem in 100 100000 1000000; do for march in x86-64 x86-64-v2 x86-64-v3 x86-64-v4 native; do printf "%16s\t%16s\t" $march $mem; gcc -g -g3 -O2 -funroll-loops -ftree-vectorize -march=$march -I ~/src/postgresql/src/include/ -I src/include/ /tmp/bench-checksums.c -o bench-checksums-native && numactl --physcpubind 1 --membind 0 ./bench-checksums-native 1000000 $mem;done; done
Workstation w/ 2x Xeon Gold 6442Y:
march mem result
x86-64 100 731.87779ms @ 11.193 GB/s
x86-64-v2 100 327.18580ms @ 25.038 GB/s
x86-64-v3 100 264.03547ms @ 31.026 GB/s
x86-64-v4 100 282.08065ms @ 29.041 GB/s
native 100 246.13766ms @ 33.282 GB/s
x86-64 100000 842.66827ms @ 9.722 GB/s
x86-64-v2 100000 604.52959ms @ 13.551 GB/s
x86-64-v3 100000 477.16239ms @ 17.168 GB/s
x86-64-v4 100000 476.07039ms @ 17.208 GB/s
native 100000 456.08080ms @ 17.962 GB/s
x86-64 1000000 845.51132ms @ 9.689 GB/s
x86-64-v2 1000000 612.07973ms @ 13.384 GB/s
x86-64-v3 1000000 485.23738ms @ 16.882 GB/s
x86-64-v4 1000000 483.86411ms @ 16.930 GB/s
native 1000000 462.88461ms @ 17.698 GB/s
Zen 4 laptop (AMD Ryzen 7 PRO 7840U):
march mem result
x86-64 100 417.19762ms @ 19.636 GB/s
x86-64-v2 100 130.67596ms @ 62.689 GB/s
x86-64-v3 100 97.07758ms @ 84.386 GB/s
x86-64-v4 100 95.67704ms @ 85.621 GB/s
native 100 95.15734ms @ 86.089 GB/s
x86-64 100000 431.38370ms @ 18.990 GB/s
x86-64-v2 100000 215.74856ms @ 37.970 GB/s
x86-64-v3 100000 199.74492ms @ 41.012 GB/s
x86-64-v4 100000 186.98300ms @ 43.811 GB/s
native 100000 187.68125ms @ 43.648 GB/s
x86-64 1000000 433.87893ms @ 18.881 GB/s
x86-64-v2 1000000 217.46561ms @ 37.670 GB/s
x86-64-v3 1000000 200.40667ms @ 40.877 GB/s
x86-64-v4 1000000 187.51978ms @ 43.686 GB/s
native 1000000 190.29273ms @ 43.049 GB/s
Workstation w/ 2x Xeon Gold 5215:
march mem result
x86-64 100 780.38881ms @ 10.497 GB/s
x86-64-v2 100 389.62005ms @ 21.026 GB/s
x86-64-v3 100 323.97294ms @ 25.286 GB/s
x86-64-v4 100 274.19493ms @ 29.877 GB/s
native 100 283.48674ms @ 28.897 GB/s
x86-64 100000 1112.63898ms @ 7.363 GB/s
x86-64-v2 100000 831.45641ms @ 9.853 GB/s
x86-64-v3 100000 696.20789ms @ 11.767 GB/s
x86-64-v4 100000 685.61636ms @ 11.948 GB/s
native 100000 689.78023ms @ 11.876 GB/s
x86-64 1000000 1128.65580ms @ 7.258 GB/s
x86-64-v2 1000000 843.92594ms @ 9.707 GB/s
x86-64-v3 1000000 718.78848ms @ 11.397 GB/s
x86-64-v4 1000000 687.68258ms @ 11.912 GB/s
native 1000000 705.34731ms @ 11.614 GB/s
That's quite the drastic difference between amd and intel. Of course it's also
comparing a multi-core server uarch (lower per-core bandwidth, much higher
aggregate bandwidth) with a client uarch.
The difference between the baseline CPU target and a more modern profile is
also rather impressive. Looks like some cpu-capability based dispatch would
likely be worth it, even if it didn't matter in my case due to -march=native.
I just realized that
a) The meson build doesn't use the relevant flags for bufpage.c - it didn't
matter in my numbers though because I was building with -O3 and
march=native.
This clearly ought to be fixed.
b) Neither build uses the optimized flags for pg_checksum and pg_upgrade, both
of which include checksum_imp.h directly.
This probably should be fixed too - perhaps by building the relevant code
once as part of fe_utils or such?
It probably matters less than it used to - these days -O2 turns on
-ftree-loop-vectorize -ftree-slp-vectorize. But loop unrolling isn't
enabled.
I do see a perf difference at -O2 between using/not using
-funroll-loops. Interestingly not at -O3, despite -funroll-loops not actually
being enabled by -O3. I think the relevant option that *is* turned on by O3 is
-fpeel-loops.
Here's a comparison of different flags run the 6442Y
printf '%16s\t%32s\t%16s\t%s\n' march flags mem result; for mem in 100 100000; do for march in x86-64 x86-64-v2 x86-64-v3 x86-64-v4 native; do for flags in "-O2" "-O2 -funroll-loops" "-O3" "-O3 -funroll-loops"; do printf "%16s\t%32s\t%16s\t" "$march" "$flags" "$mem"; gcc $flags -march=$march -I ~/src/postgresql/src/include/ -I src/include/ /tmp/bench-checksums.c -o bench-checksums-native && numactl --physcpubind 3 --membind 0 ./bench-checksums-native 3000000 $mem;done; done;done
march flags mem result
x86-64 -O2 100 2280.86253ms @ 10.775 GB/s
x86-64 -O2 -funroll-loops 100 2195.66942ms @ 11.193 GB/s
x86-64 -O3 100 2422.57588ms @ 10.145 GB/s
x86-64 -O3 -funroll-loops 100 2243.75826ms @ 10.953 GB/s
x86-64-v2 -O2 100 1243.68063ms @ 19.761 GB/s
x86-64-v2 -O2 -funroll-loops 100 979.67783ms @ 25.086 GB/s
x86-64-v2 -O3 100 988.80296ms @ 24.854 GB/s
x86-64-v2 -O3 -funroll-loops 100 991.31632ms @ 24.791 GB/s
x86-64-v3 -O2 100 1146.90165ms @ 21.428 GB/s
x86-64-v3 -O2 -funroll-loops 100 785.81395ms @ 31.275 GB/s
x86-64-v3 -O3 100 800.53627ms @ 30.699 GB/s
x86-64-v3 -O3 -funroll-loops 100 790.21230ms @ 31.101 GB/s
x86-64-v4 -O2 100 883.82916ms @ 27.806 GB/s
x86-64-v4 -O2 -funroll-loops 100 831.55372ms @ 29.554 GB/s
x86-64-v4 -O3 100 843.23141ms @ 29.145 GB/s
x86-64-v4 -O3 -funroll-loops 100 821.19969ms @ 29.927 GB/s
native -O2 100 1197.41357ms @ 20.524 GB/s
native -O2 -funroll-loops 100 718.05253ms @ 34.226 GB/s
native -O3 100 747.94090ms @ 32.858 GB/s
native -O3 -funroll-loops 100 751.52379ms @ 32.702 GB/s
x86-64 -O2 100000 2911.47087ms @ 8.441 GB/s
x86-64 -O2 -funroll-loops 100000 2525.45504ms @ 9.731 GB/s
x86-64 -O3 100000 2497.42016ms @ 9.841 GB/s
x86-64 -O3 -funroll-loops 100000 2346.33551ms @ 10.474 GB/s
x86-64-v2 -O2 100000 2124.10102ms @ 11.570 GB/s
x86-64-v2 -O2 -funroll-loops 100000 1819.09659ms @ 13.510 GB/s
x86-64-v2 -O3 100000 1613.45823ms @ 15.232 GB/s
x86-64-v2 -O3 -funroll-loops 100000 1607.09245ms @ 15.292 GB/s
x86-64-v3 -O2 100000 1972.89390ms @ 12.457 GB/s
x86-64-v3 -O2 -funroll-loops 100000 1432.58229ms @ 17.155 GB/s
x86-64-v3 -O3 100000 1533.18003ms @ 16.029 GB/s
x86-64-v3 -O3 -funroll-loops 100000 1539.39779ms @ 15.965 GB/s
x86-64-v4 -O2 100000 1591.96881ms @ 15.437 GB/s
x86-64-v4 -O2 -funroll-loops 100000 1434.91828ms @ 17.127 GB/s
x86-64-v4 -O3 100000 1454.30133ms @ 16.899 GB/s
x86-64-v4 -O3 -funroll-loops 100000 1429.13733ms @ 17.196 GB/s
native -O2 100000 1980.53734ms @ 12.409 GB/s
native -O2 -funroll-loops 100000 1373.95337ms @ 17.887 GB/s
native -O3 100000 1517.90164ms @ 16.191 GB/s
native -O3 -funroll-loops 100000 1508.37021ms @ 16.293 GB/s
> > > Is it just that the calculation is slow, or is it the fact that checksumming
> > > needs to bring the page into the CPU cache. Did you notice any hints which
> > > might be the case?
> >
> > I don't think the issue is that checksumming pulls the data into CPU caches
> >
> > 1) This is visible with SELECT that actually uses the data
> >
> > 2) I added prefetching to avoid any meaningful amount of cache misses and it
> > doesn't change the overall timing much
> >
> > 3) It's visible with buffered IO, which has pulled the data into CPU caches
> > already
>
> I didn't yet check the code, when doing aio completions checksumming
> be running on the same core as is going to be using the page?
With io_uring normally yes, the exception being that another backend that
needs the same page could end up running the completion.
With worker mode normally no.
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Michail Nikolaev | 2025-01-09 21:00:03 | Re: Why doesn't GiST VACUUM require a super-exclusive lock, like nbtree VACUUM? |
Previous Message | Sami Imseih | 2025-01-09 20:50:18 | Re: Psql meta-command conninfo+ |