From: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
---|---|
To: | Melanie Plageman <melanieplageman(at)gmail(dot)com> |
Cc: | David Rowley <dgrowleyml(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de> |
Subject: | Re: Streaming read-ready sequential scan code |
Date: | 2024-04-05 04:14:53 |
Message-ID: | CA+hUKGKXZALJ=6aArUsXRJzBm=qvc4AWp7=iJNXJQqpbRLnD_w@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Yeah, I plead benchmarking myopia, sorry. The fastpath as committed
is only reached when distance goes 2->1, as pg_prewarm does. Oops.
With the attached minor rearrangement, it works fine. I also poked
some more at that memory prefetcher. Here are the numbers I got on a
desktop system (Intel i9-9900 @ 3.1GHz, Linux 6.1, turbo disabled,
cpufreq governor=performance, 2MB huge pages, SB=8GB, consumer NMVe,
GCC -O3).
create table t (i int, filler text) with (fillfactor=10);
insert into t
select g, repeat('x', 900) from generate_series(1, 560000) g;
vacuum freeze t;
set max_parallel_workers_per_gather = 0;
select count(*) from t;
cold = must be read from actual disk (Linux drop_caches)
warm = read from linux page cache
hot = already in pg cache via pg_prewarm
cold warm hot
master 2479ms 886ms 200ms
seqscan 2498ms 716ms 211ms <-- regression
seqscan + fastpath 2493ms 711ms 200ms <-- fixed, I think?
seqscan + memprefetch 2499ms 716ms 182ms
seqscan + fastpath + memprefetch 2505ms 710ms 170ms <-- \O/
Cold has no difference. That's just my disk demonstrating Linux RA at
128kB (default); random I/O is obviously a more interesting story.
It's consistently a smidgen faster with Linux RA set to 2MB (as in
blockdev --setra 4096 /dev/nvmeXXX), and I believe this effect
probably also increases on fancier faster storage than what I have on
hand:
cold
master 1775ms
seqscan + fastpath + memprefetch 1700ms
Warm is faster as expected (fewer system calls schlepping data
kernel->userspace).
The interesting column is hot. The 200ms->211ms regression is due to
the extra bookkeeping in the slow path. The rejiggered fastpath code
fixes it for me, or maybe sometimes shows an extra 1ms. Phew. Can
you reproduce that?
The memory prefetching trick, on top of that, seems to be a good
optimisation so far. Note that that's not an entirely independent
trick, it's something we can only do now that we can see into the
future; it's the next level up of prefetching, worth doing around 60ns
before you need the data I guess. Who knows how thrashed the cache
might be before the caller gets around to accessing that page, but
there doesn't seem to be much of a cost or downside to this bet. We
know there are many more opportunities like that[1] but I don't want
to second-guess the AM here, I'm just betting that the caller is going
to look at the header.
Unfortunately there seems to be a subtle bug hiding somewhere in here,
visible on macOS on CI. Looking into that, going to find my Mac...
Attachment | Content-Type | Size |
---|---|---|
v10-0001-Use-streaming-I-O-in-heapam-sequential-scan.patch | text/x-patch | 7.0 KB |
v10-0002-Improve-read_stream.c-s-fast-path.patch | text/x-patch | 4.8 KB |
v10-0003-Add-pg_prefetch_mem-macro-to-load-cache-lines.patch | text/x-patch | 4.7 KB |
v10-0004-Prefetch-page-header-memory-when-streaming-relat.patch | text/x-patch | 1.7 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Nathan Bossart | 2024-04-05 04:15:43 | Re: Popcount optimization using AVX512 |
Previous Message | shveta malik | 2024-04-05 04:13:35 | Re: Synchronizing slots from primary to standby |