WIP: Vectored writeback

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: WIP: Vectored writeback
Date: 2024-04-04 06:29:44
Message-ID: CA+hUKGK1in4FiWtisXZ+Jo-cNSbWjmBcPww3w3DBM+whJTABXA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Here are some vectored writeback patches I worked on in the 17 cycle
and posted as part of various patch sets, but didn't get into a good
enough shape to take further. They "push" vectored writes out, but I
think what they need is to be turned inside out and converted into
users of a new hypothetical write_stream.c, so that we have a model
that will survive contact with asynchronous I/O and would "pull"
writes from a stream that controls I/O concurrency. That all seemed a
lot less urgent to work on than reads, hence leaving on ice for now.
There is a lot of code that reads, and a small finite amount that
writes. I think the patches show some aspects of the problem-space
though, and they certainly make checkpointing faster. They cover 2
out of 5ish ways we write relation data: checkpointing, and strategies
AKA ring buffers.

They make checkpoints look like this, respecting io_combine_limit,
instead of lots of 8kB writes:

pwritev(9,[...],2,0x0) = 131072 (0x20000)
pwrite(9,...,131072,0x20000) = 131072 (0x20000)
pwrite(9,...,131072,0x40000) = 131072 (0x20000)
pwrite(9,...,131072,0x60000) = 131072 (0x20000)
pwrite(9,...,131072,0x80000) = 131072 (0x20000)
...

Two more ways data gets written back are: bgwriter and regular
BAS_NORMAL buffer eviction, but they are not such natural candidates
for write combining. Well, if you know you're going to write out a
buffer, *maybe* it's worth probing the buffer pool to see if adjacent
block numbers are also present and dirty? I don't know. Before and
after? Or maybe it's better to wait for the tree-based mapping table
of legend first so it becomes cheaper to navigate in block number
order.

The 5th way is raw file copy that doesn't go through the buffer pool,
such as CREATE DATABASE ... STRATEGY=FILE_COPY, which already works
with big writes, and CREATE INDEX via bulk_write.c which is easily
converted to vectored writes, and I plan to push the patches for that
shortly. I think those should ultimately become stream-based too.

Anyway, I wanted to share these uncommitfest patches, having rebased
them over relevant recent commits, so I could leave them in working
state in case anyone is interested in this file I/O-level stuff...

Attachment Content-Type Size
v5-0001-Provide-vectored-variant-of-FlushBuffer.patch text/x-patch 13.1 KB
v5-0002-Use-vectored-writes-in-checkpointer.patch text/x-patch 10.7 KB
v5-0003-Vectored-ring-buffer-writes.patch text/x-patch 5.6 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message jian he 2024-04-04 06:41:48 Re: remaining sql/json patches
Previous Message Tom Lane 2024-04-04 06:19:45 postgres_fdw fails because GMT != UTC