From: | Ants Aasma <ants(dot)aasma(at)cybertec(dot)at> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, 陈宗志 <baotiao(at)gmail(dot)com> |
Subject: | Re: AIO v2.0 |
Date: | 2025-01-09 18:10:24 |
Message-ID: | CANwKhkOEWn7pBXyg6TnoDwrOaT=vTa-cJn3mtuXZaTneeGLXPQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, 9 Jan 2025 at 18:25, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > I'm curious about this because the checksum code should be fast enough
> > to easily handle that throughput.
>
> It seems to top out at about ~5-6 GB/s on my 2x Xeon Gold 6442Y
> workstation. But we don't have a good ready-made way of testing that without
> also doing IO, so it's kinda hard to say.
Interesting, I wonder if it's related to Intel increasing vpmulld
latency to 10 already back in Haswell. The Zen 3 I'm testing on has
latency 3 and has twice the throughput.
Attached is a naive and crude benchmark that I used for testing here.
Compiled with:
gcc -O2 -funroll-loops -ftree-vectorize -march=native \
-I$(pg_config --includedir-server) \
bench-checksums.c -o bench-checksums-native
Just fills up an array of pages and checksums them, first argument is
number of checksums, second is array size. I used 1M checksums and 100
pages for in cache behavior and 100000 pages for in memory
performance.
869.85927ms @ 9.418 GB/s - generic from memory
772.12252ms @ 10.610 GB/s - generic in cache
442.61869ms @ 18.508 GB/s - native from memory
137.07573ms @ 59.763 GB/s - native in cache
> > Is it just that the calculation is slow, or is it the fact that checksumming
> > needs to bring the page into the CPU cache. Did you notice any hints which
> > might be the case?
>
> I don't think the issue is that checksumming pulls the data into CPU caches
>
> 1) This is visible with SELECT that actually uses the data
>
> 2) I added prefetching to avoid any meaningful amount of cache misses and it
> doesn't change the overall timing much
>
> 3) It's visible with buffered IO, which has pulled the data into CPU caches
> already
I didn't yet check the code, when doing aio completions checksumming
be running on the same core as is going to be using the page?
It could also be that for some reason the checksumming is creating
extra bandwidth on memory bus or CPU internal rings, which due to the
already high amount of data already flying around causes contention.
> > I don't really have a machine at hand that can do anywhere close to this
> > amount of I/O.
>
> It's visible even when pulling from the page cache, if to a somewhat lesser
> degree.
Good point, I'll see if I can reproduce.
> I wonder if it's worth adding a test function that computes checksums of all
> shared buffers in memory already. That'd allow exercising the checksum code in
> a realistic context (i.e. buffer locking etc preventing some out-of-order
> effects, using 8kB chunks etc) without also needing to involve the IO path.
OoO shouldn't matter that much, over here even in the best case it's
still taking 500+ cycles per iteration.
> > I'm asking because if it's the calculation that is slow then it seems
> > like it's time to compile different ISA extension variants of the
> > checksum code and select the best one at runtime.
>
> You think it's ISA specific? I don't see a significant effect of compiling
> with -march=native or not - and that should suffice to make the checksum code
> built with sufficiently high ISA support, right?
Right, the disassembly below looked very good.
> FWIW CPU profiles show all the time being spent in the "main checksum
> calculation" loop:
.. disassembly omitted for brevity
Not sure if it's applicable here or not due to microarch differences.
But in my case when bounded by memory bandwidth the main loop events
were clustered around a few instructions like it was in here, whereas
when running from cache all instructions were about equally
represented.
> I did briefly experiment with changing N_SUMS. 16 is substantially worse, 64
> seems to be about the same as 32.
This suggests that mulld latency is not the culprit.
Regards,
Ants
Attachment | Content-Type | Size |
---|---|---|
bench-checksums.c | text/x-csrc | 954 bytes |
From | Date | Subject | |
---|---|---|---|
Next Message | Nathan Bossart | 2025-01-09 18:15:59 | Re: use a non-locking initial test in TAS_SPIN on AArch64 |
Previous Message | Nathan Bossart | 2025-01-09 18:06:29 | Re: New GUC autovacuum_max_threshold ? |