Re: Allow io_combine_limit up to 1MB

From: Andres Freund <andres(at)anarazel(dot)de>
To: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Tomas Vondra <tomas(at)vondra(dot)me>
Subject: Re: Allow io_combine_limit up to 1MB
Date: 2025-02-14 17:06:33
Message-ID: wgxyeb5yuyi25itl2oufnvqi3pl763vvhsysrqq6de7vhjyl46@o32rtkfovwsn
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2025-02-14 09:32:32 +0100, Jakub Wartak wrote:
> On Wed, Feb 12, 2025 at 1:03 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > FWIW, I see substantial performance *regressions* with *big* IO sizes using
> > fio. Just looking at cached buffered IO.
> >
> > for s in 4 8 16 32 64 128 256 512 1024 2048 4096 8192;do echo -ne "$s\t\t"; numactl --physcpubind 3 fio --directory /srv/dev/fio/ --size=32GiB --overwrite 1 --time_based=0 --runtime=10 --name test --rw read --buffered 0 --ioengine psync --buffered 1 --invalidate 0 --output-format json --bs=$((1024*${s})) |jq '.jobs[] | .read.bw_mean';done
> >
> > io size kB throughput in MB/s
> [..]
> > 256 16864
> > 512 19114
> > 1024 12874
> [..]
>
> > It's worth noting that if I boot with mitigations=off clearcpuid=smap I get
> > *vastly* better performance:
> >
> > io size kB throughput in MB/s
> [..]
> > 128 23133
> > 256 23317
> > 512 25829
> > 1024 15912
> [..]
> > Most of the gain isn't due to mitigations=off but clearcpuid=smap. Apparently
> > SMAP, which requires explicit code to allow kernel space to access userspace
> > memory, to make exploitation harder, reacts badly to copying lots of memory.
> >
> > This seems absolutely bonkers to me.
>
> There are two bizarre things there, +35% perf boost just like that due
> to security drama, and that io_size=512kb being so special to give a
> 10-13% boost in Your case? Any ideas, why?

I think there are a few overlapping "cost factors" and that turns out to be
the global minimum:
- syscall overhead: the fewer the better
- memory copy cost: higher for small-ish amounts, then lower
- smap costs: seems to increase with larger amounts of memory
- CPU cache: copying less than L3 cache will be faster, as otherwise memory
bandwidth plays a role

> I've got on that Lsv2
> individual MS nvme under Hyper-V, on ext4, which seems to be much more
> real world and average Joe situation, and it is much slower, but it is
> not showing advantage for blocksize beyond let's say 128:
>
> io size kB throughput in MB/s
> 4 1070
> 8 1117
> 16 1231
> 32 1264
> 64 1249
> 128 1313
> 256 1323
> 512 1257
> 1024 1216
> 2048 1271
> 4096 1304
> 8192 1214
>
> top hitter on of course stuff like clear_page_rep [k] and
> rep_movs_alternative [k] (that was with mitigations=on).

I think you're measuring something different than I was. I was purposefully
measuring a fully-cached workload, which worked with that recipe, because I
have more than 32GB of RAM available. But I assume you're running this in a VM
that doesnt have that much, and thus your're actually bencmarking reading data
from disk and - probably more influential in this case - finding buffers to
put the newly read data in.

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2025-02-14 17:14:20 Re: BitmapHeapScan streaming read user and prelim refactoring
Previous Message Andres Freund 2025-02-14 16:54:20 Re: BackgroundPsql swallowing errors on windows