From: | Ants Aasma <ants(dot)aasma(at)cybertec(dot)at> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, 陈宗志 <baotiao(at)gmail(dot)com> |
Subject: | Re: AIO v2.0 |
Date: | 2025-01-10 10:33:39 |
Message-ID: | CANwKhkMmcH9RMqueX0jhNXgCpfSraovN9AzaEY-q6O9796+hKQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, 9 Jan 2025 at 22:53, Andres Freund <andres(at)anarazel(dot)de> wrote:
<Edited to highlight interesting numbers>
> Workstation w/ 2x Xeon Gold 6442Y:
>
> march mem result
> native 100 246.13766ms @ 33.282 GB/s
> native 100000 456.08080ms @ 17.962 GB/s
>
> Zen 4 laptop (AMD Ryzen 7 PRO 7840U):
> march mem result
> native 100 95.15734ms @ 86.089 GB/s
> native 100000 187.68125ms @ 43.648 GB/s
>
> Workstation w/ 2x Xeon Gold 5215:
> march mem result
> native 100 283.48674ms @ 28.897 GB/s
> native 100000 689.78023ms @ 11.876 GB/s
>
> That's quite the drastic difference between amd and intel. Of course it's also
> comparing a multi-core server uarch (lower per-core bandwidth, much higher
> aggregate bandwidth) with a client uarch.
In hindsight building the hash around mulld primitive was a bad decision
because Intel for whatever reason decided to kill the performance of it:
vpmulld latency throughput
(values/cycle)
Sandy Bridge 5 4
Alder Lake 10 8
Zen 4 3 16
Zen 5 3 32
Most top performing hashes these days seem to be built around AES
instructions.
But I was curious why there is such a difference in streaming results.
Turns out that for whatever reason one core gets access to much less
bandwidth on Intel than on AMD. [1]
This made me take another look at your previous prewarm numbers. It looks
like performance is suspiciously proportional to the number of copies of
data the CPU has to make:
config checksums time in ms number of copies
buffered io_engine=io_uring 0 3883.127 2
buffered io_engine=io_uring 1 5880.892 3
direct io_engine=io_uring 0 2067.142 1
direct io_engine=io_uring 1 3835.968 2
To me that feels like there is a bandwidth bottleneck in this workload and
checksumming the page when the contents is not looked at just adds to
consumed bandwidth, bringing down the performance correspondingly.
This doesn't explain why you observed slowdown in the select case, but I
think that might be due to the per-core bandwidth limitation. Both cases
might pull in the same amount of data into the cache, but without checksums
it is spread out over a longer time allowing other work to happen
concurrently.
[1] https://chipsandcheese.com/p/a-peek-at-sapphire-rapids#%C2%A7bandwidth
> The difference between the baseline CPU target and a more modern profile is
> also rather impressive. Looks like some cpu-capability based dispatch would
> likely be worth it, even if it didn't matter in my case due to -march=native.
Yes, along with using function attributes for the optimization flags to avoid
the build system hacks.
--
Ants
From | Date | Subject | |
---|---|---|---|
Next Message | Ryo Kanbayashi | 2025-01-10 10:34:49 | Re: ecpg command does not warn COPY ... FROM STDIN; |
Previous Message | David Rowley | 2025-01-10 10:30:23 | Re: Incorrect CHUNKHDRSZ in nodeAgg.c |