Re: Confine vacuum skip logic to lazy_scan_skip

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: vignesh C <vignesh21(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>
Subject: Re: Confine vacuum skip logic to lazy_scan_skip
Date: 2024-06-28 21:36:25
Message-ID: CAAKRu_bbkmwAzSBgnezancgJeXrQZXy4G4kBTd+5=cr86H5yew@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Mar 17, 2024 at 2:53 AM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
>
> On Tue, Mar 12, 2024 at 10:03 AM Melanie Plageman
> <melanieplageman(at)gmail(dot)com> wrote:
> > I've rebased the attached v10 over top of the changes to
> > lazy_scan_heap() Heikki just committed and over the v6 streaming read
> > patch set. I started testing them and see that you are right, we no
> > longer pin too many buffers. However, the uncached example below is
> > now slower with streaming read than on master -- it looks to be
> > because it is doing twice as many WAL writes and syncs. I'm still
> > investigating why that is.

--snip--

> 4. For learning/exploration only, I rebased my experimental vectored
> FlushBuffers() patch, which teaches the checkpointer to write relation
> data out using smgrwritev(). The checkpointer explicitly sorts
> blocks, but I think ring buffers should naturally often contain
> consecutive blocks in ring order. Highly experimental POC code pushed
> to a public branch[2], but I am not proposing anything here, just
> trying to understand things. The nicest looking system call trace was
> with BUFFER_USAGE_LIMIT set to 512kB, so it could do its writes, reads
> and WAL writes 128kB at a time:
>
> pwrite(32,...,131072,0xfc6000) = 131072 (0x20000)
> fdatasync(32) = 0 (0x0)
> pwrite(27,...,131072,0x6c0000) = 131072 (0x20000)
> pread(27,...,131072,0x73e000) = 131072 (0x20000)
> pwrite(27,...,131072,0x6e0000) = 131072 (0x20000)
> pread(27,...,131072,0x75e000) = 131072 (0x20000)
> pwritev(27,[...],3,0x77e000) = 131072 (0x20000)
> preadv(27,[...],3,0x77e000) = 131072 (0x20000)
>
> That was a fun experiment, but... I recognise that efficient cleaning
> of ring buffers is a Hard Problem requiring more concurrency: it's
> just too late to be flushing that WAL. But we also don't want to
> start writing back data immediately after dirtying pages (cf. OS
> write-behind for big sequential writes in traditional Unixes), because
> we're not allowed to write data out without writing the WAL first and
> we currently need to build up bigger WAL writes to do so efficiently
> (cf. some other systems that can write out fragments of WAL
> concurrently so the latency-vs-throughput trade-off doesn't have to be
> so extreme). So we want to defer writing it, but not too long. We
> need something cleaning our buffers (or at least flushing the
> associated WAL, but preferably also writing the data) not too late and
> not too early, and more in sync with our scan than the WAL writer is.
> What that machinery should look like I don't know (but I believe
> Andres has ideas).

I've attached a WIP v11 streaming vacuum patch set here that is
rebased over master (by Thomas), so that I could add a CF entry for
it. It still has the problem with the extra WAL write and fsync calls
investigated by Thomas above. Thomas has some work in progress doing
streaming write-behind to alleviate the issues with the buffer access
strategy and streaming reads. When he gets a version of that ready to
share, he will start a new "Streaming Vacuum" thread.

- Melanie

Attachment Content-Type Size
v11-0002-Refactor-tidstore.c-memory-management.patch text/x-patch 8.5 KB
v11-0001-Use-streaming-I-O-in-VACUUM-first-pass.patch text/x-patch 8.3 KB
v11-0003-Use-streaming-I-O-in-VACUUM-second-pass.patch text/x-patch 3.5 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Melanie Plageman 2024-06-28 22:09:23 Re: Add LSN <-> time conversion functionality
Previous Message Tom Lane 2024-06-28 21:34:28 Re: Restart pg_usleep when interrupted