Re: AIO v2.0

From: Andres Freund <andres(at)anarazel(dot)de>
To: Ants Aasma <ants(dot)aasma(at)cybertec(dot)at>
Cc: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, 陈宗志 <baotiao(at)gmail(dot)com>
Subject: Re: AIO v2.0
Date: 2025-01-09 16:25:39
Message-ID: cidihin6txgswozfgrcs5jkzsqmrbkebhauyjjwr6uhtzqti7w@vqzav76usvmq
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2025-01-09 10:59:22 +0200, Ants Aasma wrote:
> On Wed, 8 Jan 2025 at 22:58, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > master: ~18 GB/s
> > patch, buffered: ~20 GB/s
> > patch, direct, worker: ~28 GB/s
> > patch, direct, uring: ~35 GB/s
> >
> >
> > This was with io_workers=32, io_max_concurrency=128,
> > effective_io_concurrency=1000 (doesn't need to be that high, but it's what I
> > still have the numbers for).
> >
> >
> > This was without data checksums enabled as otherwise the checksum code becomes
> > a *huge* bottleneck.
>
> I'm curious about this because the checksum code should be fast enough
> to easily handle that throughput.

It seems to top out at about ~5-6 GB/s on my 2x Xeon Gold 6442Y
workstation. But we don't have a good ready-made way of testing that without
also doing IO, so it's kinda hard to say.

> I remember checksum overhead being negligible even when pulling in pages
> from page cache.

It's indeed much less of an issue when pulling pages from the page cache, as
the copy from the page cache is fairly slow. With direct-IO, where the copy
from the page cache isn't the main driver of CPU use anymore, it becomes much
clearer.

Even with buffered IO it became a bigger issue with 17, due to
io_combine_limit. It turns out that lots of tiny syscalls are slow, so the
peak throughput that could reach the checksumming code was lower.

I created a 21554MB relation and measured the time to do a pg_prewarm() of
that relation after evicting all of shared buffers (the first time buffers are
touched has a bit different perf characteristics). I am using direct IO and
io_uring here, as buffered IO would include the page cache copy cost and
worker mode could parallelize the checksum computation on reads. The checksum
cost is a bigger issue for writes than reads, but it's harder to quickly
generate enough dirty data for a repeatable benchmark.

This system can do about 12.5GB/s of read IO.

Just to show the effect of the read size on page cache copy performance:

config checksums time in ms
buffered io_engine=sync io_combine_limit=1 0 6712.153
buffered io_engine=sync io_combine_limit=2 0 5919.215
buffered io_engine=sync io_combine_limit=4 0 5738.496
buffered io_engine=sync io_combine_limit=8 0 5396.415
buffered io_engine=sync io_combine_limit=16 0 5312.803
buffered io_engine=sync io_combine_limit=32 0 5275.389

To see the effect of page cache copy overhead:

config checksums time in ms
buffered io_engine=io_uring 0 3901.625
direct io_engine=io_uring 0 2075.330

Now to show the effect of checksums (enabled/disabled with pg_checksums):

config checksums time in ms
buffered io_engine=io_uring 0 3883.127
buffered io_engine=io_uring 1 5880.892
direct io_engine=io_uring 0 2067.142
direct io_engine=io_uring 1 3835.968

So with direct + uring w/o checksums, we can reach 10427 MB/s (close-ish to
disk speed), but with checksums we only reach 5620 MB/s.

> Is it just that the calculation is slow, or is it the fact that checksumming
> needs to bring the page into the CPU cache. Did you notice any hints which
> might be the case?

I don't think the issue is that checksumming pulls the data into CPU caches

1) This is visible with SELECT that actually uses the data

2) I added prefetching to avoid any meaningful amount of cache misses and it
doesn't change the overall timing much

3) It's visible with buffered IO, which has pulled the data into CPU caches
already

> I don't really have a machine at hand that can do anywhere close to this
> amount of I/O.

It's visible even when pulling from the page cache, if to a somewhat lesser
degree.

I wonder if it's worth adding a test function that computes checksums of all
shared buffers in memory already. That'd allow exercising the checksum code in
a realistic context (i.e. buffer locking etc preventing some out-of-order
effects, using 8kB chunks etc) without also needing to involve the IO path.

> I'm asking because if it's the calculation that is slow then it seems
> like it's time to compile different ISA extension variants of the
> checksum code and select the best one at runtime.

You think it's ISA specific? I don't see a significant effect of compiling
with -march=native or not - and that should suffice to make the checksum code
built with sufficiently high ISA support, right?

FWIW CPU profiles show all the time being spent in the "main checksum
calculation" loop:

Percent | Source code & Disassembly of postgres for cycles:P (5866 samples, percent: local period)
--------------------------------------------------------------------------------------------------------
:
:
:
: 3 Disassembly of section .text:
:
: 5 00000000009e3670 <pg_checksum_page>:
: 6 * calculation isn't affected by the old checksum stored on the page.
: 7 * Restore it after, because actually updating the checksum is NOT part of
: 8 * the API of this function.
: 9 */
: 10 save_checksum = cpage->phdr.pd_checksum;
: 11 cpage->phdr.pd_checksum = 0;
0.00 : 9e3670: xor %eax,%eax
: 13 CHECKSUM_COMP(sums[j], page->data[i][j]);
0.00 : 9e3672: mov $0x1000193,%r8d
: 15 cpage->phdr.pd_checksum = 0;
0.00 : 9e3678: vmovdqa -0x693fa0(%rip),%ymm3 # 34f6e0 <.LC0>
0.05 : 9e3680: vmovdqa -0x6935c8(%rip),%ymm4 # 3500c0 <.LC1>
0.00 : 9e3688: vmovdqa -0x693c10(%rip),%ymm0 # 34fa80 <.LC2>
0.00 : 9e3690: vmovdqa -0x693598(%rip),%ymm1 # 350100 <.LC3>
: 20 {
0.00 : 9e3698: mov %esi,%ecx
0.02 : 9e369a: lea 0x2000(%rdi),%rdx
: 23 save_checksum = cpage->phdr.pd_checksum;
0.00 : 9e36a1: movzwl 0x8(%rdi),%esi
: 25 CHECKSUM_COMP(sums[j], page->data[i][j]);
0.00 : 9e36a5: vpbroadcastd %r8d,%ymm5
: 27 cpage->phdr.pd_checksum = 0;
0.00 : 9e36ab: mov %ax,0x8(%rdi)
: 29 for (i = 0; i < (uint32) (BLCKSZ / (sizeof(uint32) * N_SUMS)); i++)
0.14 : 9e36af: mov %rdi,%rax
0.00 : 9e36b2: nopw 0x0(%rax,%rax,1)
: 32 CHECKSUM_COMP(sums[j], page->data[i][j]);
15.36 : 9e36b8: vpxord (%rax),%ymm1,%ymm1
4.79 : 9e36be: vmovdqu 0x80(%rax),%ymm2
: 35 for (i = 0; i < (uint32) (BLCKSZ / (sizeof(uint32) * N_SUMS)); i++)
0.07 : 9e36c6: add $0x100,%rax
: 37 CHECKSUM_COMP(sums[j], page->data[i][j]);
2.45 : 9e36cc: vpxord -0xe0(%rax),%ymm0,%ymm0
2.85 : 9e36d3: vpmulld %ymm5,%ymm1,%ymm6
0.02 : 9e36d8: vpsrld $0x11,%ymm1,%ymm1
3.17 : 9e36dd: vpternlogd $0x96,%ymm6,%ymm1,%ymm2
2.01 : 9e36e4: vpmulld %ymm5,%ymm0,%ymm6
13.16 : 9e36e9: vpmulld %ymm5,%ymm2,%ymm1
0.03 : 9e36ee: vpsrld $0x11,%ymm2,%ymm2
0.02 : 9e36f3: vpsrld $0x11,%ymm0,%ymm0
2.57 : 9e36f8: vpxord %ymm2,%ymm1,%ymm1
0.89 : 9e36fe: vmovdqu -0x60(%rax),%ymm2
0.12 : 9e3703: vpternlogd $0x96,%ymm6,%ymm0,%ymm2
4.48 : 9e370a: vpmulld %ymm5,%ymm2,%ymm0
0.49 : 9e370f: vpsrld $0x11,%ymm2,%ymm2
0.99 : 9e3714: vpxord %ymm2,%ymm0,%ymm0
11.88 : 9e371a: vpxord -0xc0(%rax),%ymm4,%ymm2
2.80 : 9e3721: vpmulld %ymm5,%ymm2,%ymm6
0.68 : 9e3726: vpsrld $0x11,%ymm2,%ymm4
4.94 : 9e372b: vmovdqu -0x40(%rax),%ymm2
1.45 : 9e3730: vpternlogd $0x96,%ymm6,%ymm4,%ymm2
8.63 : 9e3737: vpmulld %ymm5,%ymm2,%ymm4
0.17 : 9e373c: vpsrld $0x11,%ymm2,%ymm2
1.81 : 9e3741: vpxord %ymm2,%ymm4,%ymm4
0.10 : 9e3747: vpxord -0xa0(%rax),%ymm3,%ymm2
0.70 : 9e374e: vpmulld %ymm5,%ymm2,%ymm6
1.65 : 9e3753: vpsrld $0x11,%ymm2,%ymm3
0.03 : 9e3758: vmovdqu -0x20(%rax),%ymm2
0.85 : 9e375d: vpternlogd $0x96,%ymm6,%ymm3,%ymm2
3.73 : 9e3764: vpmulld %ymm5,%ymm2,%ymm3
0.07 : 9e3769: vpsrld $0x11,%ymm2,%ymm2
1.48 : 9e376e: vpxord %ymm2,%ymm3,%ymm3
: 68 for (i = 0; i < (uint32) (BLCKSZ / (sizeof(uint32) * N_SUMS)); i++)
0.02 : 9e3774: cmp %rax,%rdx
2.32 : 9e3777: jne 9e36b8 <pg_checksum_page+0x48>
: 71 CHECKSUM_COMP(sums[j], 0);
0.17 : 9e377d: vpmulld %ymm5,%ymm0,%ymm7
0.07 : 9e3782: vpmulld %ymm5,%ymm1,%ymm6
: 74 checksum = pg_checksum_block(cpage);
: 75 cpage->phdr.pd_checksum = save_checksum;
0.00 : 9e3787: mov %si,0x8(%rdi)
: 77 CHECKSUM_COMP(sums[j], 0);
0.02 : 9e378b: vpsrld $0x11,%ymm0,%ymm0
0.02 : 9e3790: vpsrld $0x11,%ymm1,%ymm1
0.02 : 9e3795: vpsrld $0x11,%ymm4,%ymm2
0.00 : 9e379a: vpxord %ymm0,%ymm7,%ymm7
0.10 : 9e37a0: vpmulld %ymm5,%ymm4,%ymm0
0.00 : 9e37a5: vpxord %ymm1,%ymm6,%ymm6
0.17 : 9e37ab: vpmulld %ymm5,%ymm3,%ymm1
0.19 : 9e37b0: vpmulld %ymm5,%ymm6,%ymm4
0.00 : 9e37b5: vpsrld $0x11,%ymm6,%ymm6
0.02 : 9e37ba: vpxord %ymm2,%ymm0,%ymm0
0.00 : 9e37c0: vpsrld $0x11,%ymm3,%ymm2
0.22 : 9e37c5: vpmulld %ymm5,%ymm7,%ymm3
0.02 : 9e37ca: vpsrld $0x11,%ymm7,%ymm7
0.00 : 9e37cf: vpxord %ymm2,%ymm1,%ymm1
0.03 : 9e37d5: vpsrld $0x11,%ymm0,%ymm2
0.15 : 9e37da: vpmulld %ymm5,%ymm0,%ymm0
: 94 result ^= sums[i];
0.00 : 9e37df: vpternlogd $0x96,%ymm3,%ymm7,%ymm2
: 96 CHECKSUM_COMP(sums[j], 0);
0.05 : 9e37e6: vpsrld $0x11,%ymm1,%ymm3
0.19 : 9e37eb: vpmulld %ymm5,%ymm1,%ymm1
: 99 result ^= sums[i];
0.02 : 9e37f0: vpternlogd $0x96,%ymm4,%ymm6,%ymm0
0.10 : 9e37f7: vpxord %ymm1,%ymm0,%ymm0
0.07 : 9e37fd: vpternlogd $0x96,%ymm2,%ymm3,%ymm0
0.15 : 9e3804: vextracti32x4 $0x1,%ymm0,%xmm1
0.03 : 9e380b: vpxord %xmm0,%xmm1,%xmm0
0.14 : 9e3811: vpsrldq $0x8,%xmm0,%xmm1
0.12 : 9e3816: vpxord %xmm1,%xmm0,%xmm0
0.09 : 9e381c: vpsrldq $0x4,%xmm0,%xmm1
0.12 : 9e3821: vpxord %xmm1,%xmm0,%xmm0
0.05 : 9e3827: vmovd %xmm0,%eax
:
: 111 /* Mix in the block number to detect transposed pages */
: 112 checksum ^= blkno;
0.07 : 9e382b: xor %ecx,%eax
:
: 115 /*
: 116 * Reduce to a uint16 (to fit in the pd_checksum field) with an offset of
: 117 * one. That avoids checksums of zero, which seems like a good idea.
: 118 */
: 119 return (uint16) ((checksum % 65535) + 1);
0.00 : 9e382d: mov $0x80008001,%ecx
0.03 : 9e3832: mov %eax,%edx
0.27 : 9e3834: imul %rcx,%rdx
0.09 : 9e3838: shr $0x2f,%rdx
0.07 : 9e383c: lea 0x1(%rax,%rdx,1),%eax
0.00 : 9e3840: vzeroupper
: 126 }
0.15 : 9e3843: ret

I did briefly experiment with changing N_SUMS. 16 is substantially worse, 64
seems to be about the same as 32.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Regina Obe 2025-01-09 16:35:18 RE: Exporting float_to_shortest_decimal_buf(n) with Postgres 17 on Windows
Previous Message Peter Eisentraut 2025-01-09 16:23:11 Re: Moving the vacuum GUCs' docs out of the Client Connection Defaults section