Re: BAS_BULKREAD vs read stream

From: Andres Freund <andres(at)anarazel(dot)de>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: BAS_BULKREAD vs read stream
Date: 2025-04-07 20:28:20
Message-ID: dr4rjc4xewy5uf2dzywuq2fh6fnaydiwxexumjx3b6hkefatcn@kibyxaztit2i
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2025-04-07 15:24:43 -0400, Melanie Plageman wrote:
> On Sun, Apr 6, 2025 at 4:15 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> >
> > I think we should consider increasing BAS_BULKREAD TO something like
> > Min(256, io_combine_limit * (effective_io_concurrency + 1))
>
> Do you mean Max? If so, this basically makes sense to me.

Err, yes.

I was wondering whether we should add a Max(SYNC_SCAN_REPORT_INTERVAL, ...),
but it's a private value, and the proposed formula doesn't really change
anything for SYNC_SCAN_REPORT_INTERVAL. So I think it's fine.

> Overall, I think even though the ring is about reusing buffers, we
> have to think about how many IOs that reasonably is -- which this
> formula does.

Right - the prior limit kinda somewhat made sense before we had IO combining,
but after that *and* having AIO it is clearly obsoleted.

> You mentioned testing with 8MB, did you see some sort of clipp anywhere
> between 256 and 8MB?

There's not really a single cliff.

For buffered, fully cached IO:

With io_method=sync, it gets way better between 64 and 128kB, then gets worse
between 128kB and 256kB (the current value), and then seems to gradually gets
worse starting somewhere around 8MB. 32MB is 50% slower than 8MB...

io_method=worker is awful with 64-128kB, not great at 256kB and then is very
good. There's a 10% decline from 16MB->32MB.

io_method=io_uring is similar to sync at 64-128kB, very good from then on. I
do see a 6% decline from 16MB->32MB.

I suspect the 16-32MB cliff is due to L3 related effects, which is 13.8M per
per socket (of which I have 2). It's not entirely clear what that effect is -
all the additional cycles are spent in the kernel, not in userspace. I
strongly suspect it's related to SMAP [1], but I don't really understand the
details. All I know is that disabling SMAP removes this cliff on several Intel
and AMD systems, both client and server CPUs.

For buffered, non-cached IO:

io_method=sync: I see no performance difference across all ring sizes.

io_method=worker: Performance is ~12% worse than sync at <= 256kB, 1.36x
faster at 512kB, 2.07x at 1MB, 3.0x at 4MB, and then it stays the same
up to 64MB.

io_method=io_uring: equivalent to sync at <= 256kB, 1.54x faster at 512kB,
3.2x faster at 4MB and stays the same up to 64MB.

For DIO/unbuffered IO:

As io_method=sync, obviously, doesn't do DIO/unbuffered IO in a reasonable
way, it doesn't make sense to compare it. So I'm comparing to buffered IO.

io_method=worker: Performance is terrifyingly bad at 128kB (like 0.41x the
throughput of buffered IO), slightly worse than buffered at 256kB, Best perf
is reached at 4MB and stays very consistent after that.

io_method=uring: Performance is terrifyingly bad at <= 256kB (like 0.43x the
throughput of buffered IO) and starts to be decent after that. Best perf is
reached at 4MB and stays very consistent after that.

The peak perf of buffered but uncached IO and DIO is rather close, as
I'm testing this on a PCIe3 drive.

The difference in CPU cycles is massive though:

worker buffered:

9,850.27 msec cpu-clock # 3.001 CPUs utilized
305,050 context-switches # 30.969 K/sec
51,049 cpu-migrations # 5.182 K/sec
11,530 page-faults # 1.171 K/sec
16,615,532,455 instructions # 0.84 insn per cycle (30.72%)
19,876,584,840 cycles # 2.018 GHz (30.75%)
3,256,065,951 branches # 330.556 M/sec (30.78%)
26,046,144 branch-misses # 0.80% of all branches (30.81%)
4,452,808,846 L1-dcache-loads # 452.050 M/sec (30.83%)
574,304,216 L1-dcache-load-misses # 12.90% of all L1-dcache accesses (30.82%)
169,117,254 LLC-loads # 17.169 M/sec (30.82%)
82,769,152 LLC-load-misses # 48.94% of all LL-cache accesses (30.82%)
377,137,247 L1-icache-load-misses (30.78%)
4,475,873,620 dTLB-loads # 454.391 M/sec (30.76%)
5,496,266 dTLB-load-misses # 0.12% of all dTLB cache accesses (30.73%)
9,765,507 iTLB-loads # 991.395 K/sec (30.70%)
7,525,173 iTLB-load-misses # 77.06% of all iTLB cache accesses (30.70%)

3.282465335 seconds time elapsed

worker DIO:
9,783.05 msec cpu-clock # 3.000 CPUs utilized
356,102 context-switches # 36.400 K/sec
32,575 cpu-migrations # 3.330 K/sec
1,245 page-faults # 127.261 /sec
8,076,414,780 instructions # 1.00 insn per cycle (30.73%)
8,109,508,194 cycles # 0.829 GHz (30.73%)
1,585,426,781 branches # 162.058 M/sec (30.74%)
17,869,296 branch-misses # 1.13% of all branches (30.78%)
2,199,974,033 L1-dcache-loads # 224.876 M/sec (30.79%)
167,855,899 L1-dcache-load-misses # 7.63% of all L1-dcache accesses (30.79%)
31,303,238 LLC-loads # 3.200 M/sec (30.79%)
2,126,825 LLC-load-misses # 6.79% of all LL-cache accesses (30.79%)
322,505,615 L1-icache-load-misses (30.79%)
2,186,161,593 dTLB-loads # 223.464 M/sec (30.79%)
3,892,051 dTLB-load-misses # 0.18% of all dTLB cache accesses (30.79%)
10,306,643 iTLB-loads # 1.054 M/sec (30.77%)
6,279,217 iTLB-load-misses # 60.92% of all iTLB cache accesses (30.74%)

3.260901966 seconds time elapsed

io_uring buffered:

9,924.48 msec cpu-clock # 2.990 CPUs utilized
340,821 context-switches # 34.341 K/sec
57,048 cpu-migrations # 5.748 K/sec
1,336 page-faults # 134.617 /sec
16,630,629,989 instructions # 0.88 insn per cycle (30.74%)
18,985,579,559 cycles # 1.913 GHz (30.64%)
3,253,081,357 branches # 327.784 M/sec (30.67%)
24,599,858 branch-misses # 0.76% of all branches (30.68%)
4,515,979,721 L1-dcache-loads # 455.035 M/sec (30.69%)
556,041,180 L1-dcache-load-misses # 12.31% of all L1-dcache accesses (30.67%)
160,198,962 LLC-loads # 16.142 M/sec (30.65%)
75,164,349 LLC-load-misses # 46.92% of all LL-cache accesses (30.65%)
348,585,830 L1-icache-load-misses (30.63%)
4,473,414,356 dTLB-loads # 450.746 M/sec (30.91%)
1,193,495 dTLB-load-misses # 0.03% of all dTLB cache accesses (31.04%)
5,507,512 iTLB-loads # 554.942 K/sec (31.02%)
2,973,177 iTLB-load-misses # 53.98% of all iTLB cache accesses (31.02%)

3.319117422 seconds time elapsed

io_uring DIO:

9,782.99 msec cpu-clock # 3.000 CPUs utilized
96,916 context-switches # 9.907 K/sec
8 cpu-migrations # 0.818 /sec
1,001 page-faults # 102.320 /sec
5,902,978,172 instructions # 1.45 insn per cycle (30.73%)
4,059,940,112 cycles # 0.415 GHz (30.73%)
1,117,690,786 branches # 114.248 M/sec (30.75%)
10,994,087 branch-misses # 0.98% of all branches (30.77%)
1,559,149,686 L1-dcache-loads # 159.374 M/sec (30.78%)
85,057,280 L1-dcache-load-misses # 5.46% of all L1-dcache accesses (30.78%)
11,393,236 LLC-loads # 1.165 M/sec (30.78%)
2,599,701 LLC-load-misses # 22.82% of all LL-cache accesses (30.79%)
174,124,990 L1-icache-load-misses (30.80%)
1,545,148,685 dTLB-loads # 157.942 M/sec (30.79%)
156,524 dTLB-load-misses # 0.01% of all dTLB cache accesses (30.79%)
3,325,307 iTLB-loads # 339.907 K/sec (30.77%)
2,288,730 iTLB-load-misses # 68.83% of all iTLB cache accesses (30.74%)

3.260716339 seconds time elapsed

I'd say a 4.5x reduction in cycles is rather nice :)

> > I experimented some whether SYNC_SCAN_REPORT_INTERVAL should be increased, and
> > couldn't come up with any benefits. It seems to hurt fairly quickly.
>
> So, how will you deal with it when the BAS_BULKREAD ring is bigger?

I think I would just leave it at the current value. What I meant with "hurt
fairly quickly" is that *increasing* SYNC_SCAN_REPORT_INTERVAL seems to make
synchronize_seqscans work even less well.

Greetings,

Andres Freund

[1] https://en.wikipedia.org/wiki/Supervisor_Mode_Access_Prevention

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hannu Krosing 2025-04-07 20:33:47 Horribly slow pg_upgrade performance with many Large Objects
Previous Message Sami Imseih 2025-04-07 20:27:30 Re: track generic and custom plans in pg_stat_statements