From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Tomas Vondra <tomas(at)vondra(dot)me> |
Subject: | Re: Allow io_combine_limit up to 1MB |
Date: | 2025-02-12 00:03:27 |
Message-ID: | yhklc3wuxt4l42tpah37rzsxoycresoiae22h2eluotrwr37gq@3r54w5zqldwn |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2025-02-11 13:12:17 +1300, Thomas Munro wrote:
> Tomas queried[1] the limit of 256kB (or really 32 blocks) for
> io_combine_limit. Yeah, I think we should increase it and allow
> experimentation with larger numbers. Note that real hardware and
> protocols have segment and size limits that can force the kernel to
> split your I/Os, so it's not at all a given that it'll help much or at
> all to use very large sizes, but YMMV.
FWIW, I see substantial performance *regressions* with *big* IO sizes using
fio. Just looking at cached buffered IO.
for s in 4 8 16 32 64 128 256 512 1024 2048 4096 8192;do echo -ne "$s\t\t"; numactl --physcpubind 3 fio --directory /srv/dev/fio/ --size=32GiB --overwrite 1 --time_based=0 --runtime=10 --name test --rw read --buffered 0 --ioengine psync --buffered 1 --invalidate 0 --output-format json --bs=$((1024*${s})) |jq '.jobs[] | .read.bw_mean';done
io size kB throughput in MB/s
4 6752
8 9297
16 11082
32 14392
64 15967
128 16658
256 16864
512 19114
1024 12874
2048 11770
4096 11781
8192 11744
I.e. throughput peaks at 19GB/s and drops of fairly noticeably after that.
I've measured this on a number of different AMD and Intel Systems, with
similar results, albeit with different inflection points. On the Intel systems
I have access to the point where things slows down seems typically be earlier
than on AMD.
It's worth noting that if I boot with mitigations=off clearcpuid=smap I get
*vastly* better performance:
io size kB throughput in MB/s
4 12054
8 13872
16 16709
32 20564
64 22559
128 23133
256 23317
512 25829
1024 15912
2048 15213
4096 14129
8192 13795
Most of the gain isn't due to mitigations=off but clearcpuid=smap. Apparently
SMAP, which requires explicit code to allow kernel space to access userspace
memory, to make exploitation harder, reacts badly to copying lots of memory.
This seems absolutely bonkers to me.
> I was originally cautious because I didn't want to make a few stack buffers
> too big, but arrays of BlockNumber, struct iovec, and pointer don't seem too
> excessive at say 128 (cf whole blocks on the stack, a thing we do, which
> would still be many times larger that the relevant arrays). I was also
> anticipating future code that would need to multiply that number by other
> terms to allocate shared memory, but after some off-list discussion, that
> seems OK: such code should be able to deal with that using GUCs instead of
> maximally pessimal allocation. 128 gives a nice round number of 1M as a
> maximum transfer size, and comparable systems seem to have upper limits
> around that mark. Patch attached.
To make that possible we'd need two different io_combine_limit GUCs, one
PGC_POSTMASTER that defines a hard max, and one that can be changed at
runtime, up to the PGC_POSTMASTER one.
It's somewhat painful to have such GUCs, because we don't have real
infrastructure for interdependent GUCs. Typically the easiest way is to just
do a Min() at runtime between the two GUCs. But looking at the number of
references to io_combine_limit in read_stream.c, that doesn't look like fun.
Do you have a good idea how to keep read_stream.c readable?
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2025-02-12 00:08:00 | Re: [PATCH] Optionally record Plan IDs to track plan changes for a query |
Previous Message | Peter Smith | 2025-02-11 23:58:42 | Re: Enhance 'pg_createsubscriber' to retrieve databases automatically when no database is provided. |