From: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
---|---|
To: | Melanie Plageman <melanieplageman(at)gmail(dot)com> |
Cc: | vignesh C <vignesh21(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru> |
Subject: | Re: Confine vacuum skip logic to lazy_scan_skip |
Date: | 2024-03-17 06:53:10 |
Message-ID: | CA+hUKGLY4Q4ZY4f1rvnFtv6+PkjNf8MejdPkcju3Qii9DYqqcQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Mar 12, 2024 at 10:03 AM Melanie Plageman
<melanieplageman(at)gmail(dot)com> wrote:
> I've rebased the attached v10 over top of the changes to
> lazy_scan_heap() Heikki just committed and over the v6 streaming read
> patch set. I started testing them and see that you are right, we no
> longer pin too many buffers. However, the uncached example below is
> now slower with streaming read than on master -- it looks to be
> because it is doing twice as many WAL writes and syncs. I'm still
> investigating why that is.
That makes sense to me. We have 256kB of buffers in our ring, but now
we're trying to read ahead 128kB at a time, so it works out that we
can only flush the WAL accumulated while dirtying half the blocks at a
time, so we flush twice as often.
If I change the ring size to 384kB, allowing for that read-ahead
window, I see approximately the same WAL flushes. Surely we'd never
be able to get the behaviour to match *and* keep the same ring size?
We simply need those 16 extra buffers to have a chance of accumulating
32 dirty buffers, and the associated WAL. Do you see the same result,
or do you think something more than that is wrong here?
Here are some system call traces using your test that helped me see
the behaviour:
1. Unpatched, ie no streaming read, we flush 90kB of WAL generated by
32 pages before we write them out one at a time just before we read in
their replacements. One flush covers the LSNs of all the pages that
will be written, even though it's only called for the first page to be
written. That's because XLogFlush(lsn), if it decides to do anything,
flushes as far as it can... IOW when we hit the *oldest* dirty block,
that's when we write out the WAL up to where we dirtied the *newest*
block, which covers the 32 pwrite() calls here:
pwrite(30,...,90112,0xf90000) = 90112 (0x16000)
fdatasync(30) = 0 (0x0)
pwrite(27,...,8192,0x0) = 8192 (0x2000)
pread(27,...,8192,0x40000) = 8192 (0x2000)
pwrite(27,...,8192,0x2000) = 8192 (0x2000)
pread(27,...,8192,0x42000) = 8192 (0x2000)
pwrite(27,...,8192,0x4000) = 8192 (0x2000)
pread(27,...,8192,0x44000) = 8192 (0x2000)
pwrite(27,...,8192,0x6000) = 8192 (0x2000)
pread(27,...,8192,0x46000) = 8192 (0x2000)
pwrite(27,...,8192,0x8000) = 8192 (0x2000)
pread(27,...,8192,0x48000) = 8192 (0x2000)
pwrite(27,...,8192,0xa000) = 8192 (0x2000)
pread(27,...,8192,0x4a000) = 8192 (0x2000)
pwrite(27,...,8192,0xc000) = 8192 (0x2000)
pread(27,...,8192,0x4c000) = 8192 (0x2000)
pwrite(27,...,8192,0xe000) = 8192 (0x2000)
pread(27,...,8192,0x4e000) = 8192 (0x2000)
pwrite(27,...,8192,0x10000) = 8192 (0x2000)
pread(27,...,8192,0x50000) = 8192 (0x2000)
pwrite(27,...,8192,0x12000) = 8192 (0x2000)
pread(27,...,8192,0x52000) = 8192 (0x2000)
pwrite(27,...,8192,0x14000) = 8192 (0x2000)
pread(27,...,8192,0x54000) = 8192 (0x2000)
pwrite(27,...,8192,0x16000) = 8192 (0x2000)
pread(27,...,8192,0x56000) = 8192 (0x2000)
pwrite(27,...,8192,0x18000) = 8192 (0x2000)
pread(27,...,8192,0x58000) = 8192 (0x2000)
pwrite(27,...,8192,0x1a000) = 8192 (0x2000)
pread(27,...,8192,0x5a000) = 8192 (0x2000)
pwrite(27,...,8192,0x1c000) = 8192 (0x2000)
pread(27,...,8192,0x5c000) = 8192 (0x2000)
pwrite(27,...,8192,0x1e000) = 8192 (0x2000)
pread(27,...,8192,0x5e000) = 8192 (0x2000)
pwrite(27,...,8192,0x20000) = 8192 (0x2000)
pread(27,...,8192,0x60000) = 8192 (0x2000)
pwrite(27,...,8192,0x22000) = 8192 (0x2000)
pread(27,...,8192,0x62000) = 8192 (0x2000)
pwrite(27,...,8192,0x24000) = 8192 (0x2000)
pread(27,...,8192,0x64000) = 8192 (0x2000)
pwrite(27,...,8192,0x26000) = 8192 (0x2000)
pread(27,...,8192,0x66000) = 8192 (0x2000)
pwrite(27,...,8192,0x28000) = 8192 (0x2000)
pread(27,...,8192,0x68000) = 8192 (0x2000)
pwrite(27,...,8192,0x2a000) = 8192 (0x2000)
pread(27,...,8192,0x6a000) = 8192 (0x2000)
pwrite(27,...,8192,0x2c000) = 8192 (0x2000)
pread(27,...,8192,0x6c000) = 8192 (0x2000)
pwrite(27,...,8192,0x2e000) = 8192 (0x2000)
pread(27,...,8192,0x6e000) = 8192 (0x2000)
pwrite(27,...,8192,0x30000) = 8192 (0x2000)
pread(27,...,8192,0x70000) = 8192 (0x2000)
pwrite(27,...,8192,0x32000) = 8192 (0x2000)
pread(27,...,8192,0x72000) = 8192 (0x2000)
pwrite(27,...,8192,0x34000) = 8192 (0x2000)
pread(27,...,8192,0x74000) = 8192 (0x2000)
pwrite(27,...,8192,0x36000) = 8192 (0x2000)
pread(27,...,8192,0x76000) = 8192 (0x2000)
pwrite(27,...,8192,0x38000) = 8192 (0x2000)
pread(27,...,8192,0x78000) = 8192 (0x2000)
pwrite(27,...,8192,0x3a000) = 8192 (0x2000)
pread(27,...,8192,0x7a000) = 8192 (0x2000)
pwrite(27,...,8192,0x3c000) = 8192 (0x2000)
pread(27,...,8192,0x7c000) = 8192 (0x2000)
pwrite(27,...,8192,0x3e000) = 8192 (0x2000)
pread(27,...,8192,0x7e000) = 8192 (0x2000)
(Digression: this alternative tail-write-head-read pattern defeats the
read-ahead and write-behind on a bunch of OSes, but not Linux because
it only seems to worry about the reads, while other Unixes have
write-behind detection too, and I believe at least some are confused
by this pattern of tiny writes following along some distance behind
tiny reads; Andrew Gierth figured that out after noticing poor ring
buffer performance, and we eventually got that fixed for one such
system[1], separating the sequence detection for reads and writes.)
2. With your patches, we replace all those little pread calls with
nice wide calls, yay!, but now we only manage to write out about half
the amount of WAL at a time as you discovered. The repeating blocks
of system calls now look like this, but there are twice as many of
them:
pwrite(32,...,40960,0x224000) = 40960 (0xa000)
fdatasync(32) = 0 (0x0)
pwrite(27,...,8192,0x5c000) = 8192 (0x2000)
preadv(27,[...],3,0x7e000) = 131072 (0x20000)
pwrite(27,...,8192,0x5e000) = 8192 (0x2000)
pwrite(27,...,8192,0x60000) = 8192 (0x2000)
pwrite(27,...,8192,0x62000) = 8192 (0x2000)
pwrite(27,...,8192,0x64000) = 8192 (0x2000)
pwrite(27,...,8192,0x66000) = 8192 (0x2000)
pwrite(27,...,8192,0x68000) = 8192 (0x2000)
pwrite(27,...,8192,0x6a000) = 8192 (0x2000)
pwrite(27,...,8192,0x6c000) = 8192 (0x2000)
pwrite(27,...,8192,0x6e000) = 8192 (0x2000)
pwrite(27,...,8192,0x70000) = 8192 (0x2000)
pwrite(27,...,8192,0x72000) = 8192 (0x2000)
pwrite(27,...,8192,0x74000) = 8192 (0x2000)
pwrite(27,...,8192,0x76000) = 8192 (0x2000)
pwrite(27,...,8192,0x78000) = 8192 (0x2000)
pwrite(27,...,8192,0x7a000) = 8192 (0x2000)
3. With your patches and test but this time using VACUUM
(BUFFER_USAGE_LIMIT = '384kB'), the repeating block grows bigger and
we get the larger WAL flushes back again, because now we're able to
collect 32 blocks' worth of WAL up front again:
pwrite(32,...,90112,0x50c000) = 90112 (0x16000)
fdatasync(32) = 0 (0x0)
pwrite(27,...,8192,0x1dc000) = 8192 (0x2000)
pread(27,...,131072,0x21e000) = 131072 (0x20000)
pwrite(27,...,8192,0x1de000) = 8192 (0x2000)
pwrite(27,...,8192,0x1e0000) = 8192 (0x2000)
pwrite(27,...,8192,0x1e2000) = 8192 (0x2000)
pwrite(27,...,8192,0x1e4000) = 8192 (0x2000)
pwrite(27,...,8192,0x1e6000) = 8192 (0x2000)
pwrite(27,...,8192,0x1e8000) = 8192 (0x2000)
pwrite(27,...,8192,0x1ea000) = 8192 (0x2000)
pwrite(27,...,8192,0x1ec000) = 8192 (0x2000)
pwrite(27,...,8192,0x1ee000) = 8192 (0x2000)
pwrite(27,...,8192,0x1f0000) = 8192 (0x2000)
pwrite(27,...,8192,0x1f2000) = 8192 (0x2000)
pwrite(27,...,8192,0x1f4000) = 8192 (0x2000)
pwrite(27,...,8192,0x1f6000) = 8192 (0x2000)
pwrite(27,...,8192,0x1f8000) = 8192 (0x2000)
pwrite(27,...,8192,0x1fa000) = 8192 (0x2000)
pwrite(27,...,8192,0x1fc000) = 8192 (0x2000)
preadv(27,[...],3,0x23e000) = 131072 (0x20000)
pwrite(27,...,8192,0x1fe000) = 8192 (0x2000)
pwrite(27,...,8192,0x200000) = 8192 (0x2000)
pwrite(27,...,8192,0x202000) = 8192 (0x2000)
pwrite(27,...,8192,0x204000) = 8192 (0x2000)
pwrite(27,...,8192,0x206000) = 8192 (0x2000)
pwrite(27,...,8192,0x208000) = 8192 (0x2000)
pwrite(27,...,8192,0x20a000) = 8192 (0x2000)
pwrite(27,...,8192,0x20c000) = 8192 (0x2000)
pwrite(27,...,8192,0x20e000) = 8192 (0x2000)
pwrite(27,...,8192,0x210000) = 8192 (0x2000)
pwrite(27,...,8192,0x212000) = 8192 (0x2000)
pwrite(27,...,8192,0x214000) = 8192 (0x2000)
pwrite(27,...,8192,0x216000) = 8192 (0x2000)
pwrite(27,...,8192,0x218000) = 8192 (0x2000)
pwrite(27,...,8192,0x21a000) = 8192 (0x2000)
4. For learning/exploration only, I rebased my experimental vectored
FlushBuffers() patch, which teaches the checkpointer to write relation
data out using smgrwritev(). The checkpointer explicitly sorts
blocks, but I think ring buffers should naturally often contain
consecutive blocks in ring order. Highly experimental POC code pushed
to a public branch[2], but I am not proposing anything here, just
trying to understand things. The nicest looking system call trace was
with BUFFER_USAGE_LIMIT set to 512kB, so it could do its writes, reads
and WAL writes 128kB at a time:
pwrite(32,...,131072,0xfc6000) = 131072 (0x20000)
fdatasync(32) = 0 (0x0)
pwrite(27,...,131072,0x6c0000) = 131072 (0x20000)
pread(27,...,131072,0x73e000) = 131072 (0x20000)
pwrite(27,...,131072,0x6e0000) = 131072 (0x20000)
pread(27,...,131072,0x75e000) = 131072 (0x20000)
pwritev(27,[...],3,0x77e000) = 131072 (0x20000)
preadv(27,[...],3,0x77e000) = 131072 (0x20000)
That was a fun experiment, but... I recognise that efficient cleaning
of ring buffers is a Hard Problem requiring more concurrency: it's
just too late to be flushing that WAL. But we also don't want to
start writing back data immediately after dirtying pages (cf. OS
write-behind for big sequential writes in traditional Unixes), because
we're not allowed to write data out without writing the WAL first and
we currently need to build up bigger WAL writes to do so efficiently
(cf. some other systems that can write out fragments of WAL
concurrently so the latency-vs-throughput trade-off doesn't have to be
so extreme). So we want to defer writing it, but not too long. We
need something cleaning our buffers (or at least flushing the
associated WAL, but preferably also writing the data) not too late and
not too early, and more in sync with our scan than the WAL writer is.
What that machinery should look like I don't know (but I believe
Andres has ideas).
[1] https://github.com/freebsd/freebsd-src/commit/f2706588730a5d3b9a687ba8d4269e386650cc4f
[2] https://github.com/macdice/postgres/tree/vectored-ring-buffer
From | Date | Subject | |
---|---|---|---|
Next Message | Bharath Rupireddy | 2024-03-17 08:33:10 | Re: Introduce XID age and inactive timeout based replication slot invalidation |
Previous Message | Bharath Rupireddy | 2024-03-17 06:31:00 | Simplify backtrace_functions GUC code |