AIO v2.5

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Noah Misch <noah(at)leadboat(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
Subject: AIO v2.5
Date: 2025-03-04 19:00:14
Message-ID: vz5i2x2wkjjpp6l2z4g4l3umpxktrxsejgdwefi7n6kzw666h3@rnhlvlpadig7
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Attached is v2.5 of the AIO patchset.

Relative to 2.4 I:

- Committed some earlier commits. I ended up *not* committing the patch to
create resowners in more backends (e.g. walsender), as that's not really a
dependency for now.

One of the more important things to get committed was in a separate thread:
https://postgr.es/m/b6vveqz6r3wno66rho5lqi6z5kyhfgtvi3jcodyq5rlpp3cu44%40c6dsgf3z7yhs

Now relpath() can be used for logging while in a critical section. That
alone allowed to remove most of the remaining FIXMEs.

- Split md.c read/write patches, the write side is more complicated and isn't
needed before write support arrives (much later in the queue and very likely
not for 18).

The complicated bit about write support is needing to
register_dirty_segment() after completion of the write. If
RegisterSyncRequest() fails, the IO completer needs to open the file and
sync itself, unfortunately PathNameOpenFile() allocates memory, which isn't
ok while in a critical section (even though it'd not be detected, as it's
using malloc()).

- Reordered patches so that Thomas' read_stream work is after the basic AIO
infrastructure patches, there's no dependency to the earlier patches

I think Thomas might have a newer version of some of these, but since
they're not intended to be committed as part of this, I didn't spend the
time to rebase to the last version.

- Added a small bit of data that can be provided to callbacks, that makes it a
lot cleaner to transport information like ZERO_ON_ERROR.

I also did s/shared_callbacks/callbacks/, as the prior name was outdated.

- Substantially expanded tests, most importantly generic temp file tests and
AIO specific cross-backend tests

As part of the expanded tests I also needed to export TerminateBufferIO(),
like, as previously mentioned, already done in an earlier version for
StartBufferIO(). Nobody commented on that, so I think that's ok.

I also renamed the tests away from the very inventively named tbl_a, tbl_b...

- Moved the commit to create resownern in more places to much later in the
queue, it's not actually needed for bufmgr.c IO, and nothing needing it will
land in 18

- Added a proper commit message fo the main commit. I'd appreciate folks
reading through it. I'm sure I forgot a lot of folks and a lot of things.

- Did a fair bit of of comment polishing

- Addressed an XXX in the "aio infrastructure" commit suggesting that we might
want to error out if a backend is waiting on is own unsubmitted IO. Noah
argued for erroring out. I now made it so.

- Temporarily added a commit to increase open-file limit on openbsd. I saw
related errors without this patch too, but it fails more often with. I
already sent a separate email about this.

At this point I am not aware of anything significant left to do in the main
AIO commit, safe some of the questions below. There's a lot more potential
optimizations etc, but this is already a very complicated piece of work, so I
think they just has to wait for later.

There are a few things to clean up in the bufmgr.c commits, I don't yet quite
like the function naming and there could be a bit less duplication. But I
don't think that needs to be resolved before the main commit.

Questions:

- My current thinking is that we'd set io_method = worker initially - so we
actually get some coverage - and then decide whether to switch to
io_method=sync by default for 18 sometime around beta1/2. Does that sound
reasonable?

- We could reduce memory usage a tiny bit if we made the mapping between
pgproc and per-backend-aio-state more complicated, i.e. not just indexed by
ProcNumber. Right now IO workers have the per-backend AIO state, but don't
actually need it. I'm mildly inclined to think that the complexity isn't
worth it, but on the fence.

- Three of the commits in the series really are just precursor commits to
their subsequent commits, which I found helpful for development and review,
namely:

- aio: Basic subsystem initialization
- aio: Skeleton IO worker infrastructure
- aio: Add liburing dependency

Not sure if it's worth keeping these separate or whether they should just be
merged with their "real commit".

- Thomas suggested renaming
COMPLETED_IO->COMPLETED,
COMPLETED_SHARED->TERMINATED_BY_COMPLETER,
COMPLETED_SHARED->TERMINATED_BY_SUBMITTER
in
https://www.postgresql.org/message-id/CA%2BhUKGLxH1tsUgzZfng4BU6GqnS6bKF2ThvxH1_w5c7-sLRKQw%40mail.gmail.com

While the other things in the email were commented upon by others and
addressed in v2.4, the naming aspect wasn't further remarked upon by others.
I'm not personally in love with the suggested names, but I could live with
them.

- Right now this series defines PGAIO_VERBOSE to 1. That's good for debugging,
but all the ereport()s add a noticeable amount of overhead at high IO
throughput (at multiple gigabytes/second), so that's probably not right
forever. I'd leave this on initially and then change it to default to off
later. I think that's ok?

- To allow io_workers to be PGC_SIGHUP, and to eventually allow to
automatically in/decrease active workers, the max number of workers (32) is
always allocated. That means we use more semaphores than before. I think
that's ok, it's not 1995 anymore. Alternatively we can add a
"io_workers_max" GUC and probe for it in initdb.

- pg_stat_aios currently has the IO Handle flags as dedicated columns. Not
sure that's great?

They could be an enum array or such too? That'd perhaps be a bit more
extensible? OTOH, we don't currently use enums in the catalogs and arrays
are somewhat annoying to conjure up from C.

Todo:

- A few more passes over the main commit, I'm sure there's a few more inartful
comments, odd formatting and such.

- Check if there's a decent way to deduplicate pgaio_io_call_complete_shared() and
pgaio_io_call_complete_local()

- Figure out how to deduplicate support for LockBufferForCleanup() in
TerminateBufferIO().

- Documentation for pg_stat_aios.

- Check if documentation for track_io_timing needs to be adjusted, after the
bufmgr.c changes we only track waiting for an IO.

- Some of the test_aio code is specific to non-temp tables, it probably is
worth generalizing to deal with temp tables and invoke them for both.

Greetings,

Andres

Attachment Content-Type Size
v2.5-0001-aio-Basic-subsystem-initialization.patch text/x-diff 16.7 KB
v2.5-0002-aio-Add-asynchronous-I-O-infrastructure.patch text/x-diff 93.4 KB
v2.5-0003-aio-Skeleton-IO-worker-infrastructure.patch text/x-diff 20.5 KB
v2.5-0004-aio-Add-worker-method.patch text/x-diff 21.2 KB
v2.5-0005-aio-Add-liburing-dependency.patch text/x-diff 13.2 KB
v2.5-0006-aio-Add-io_uring-method.patch text/x-diff 16.4 KB
v2.5-0007-aio-Add-README.md-explaining-higher-level-desig.patch text/x-diff 18.6 KB
v2.5-0008-aio-Implement-smgr-md-fd-read-support.patch text/x-diff 20.5 KB
v2.5-0009-aio-Add-pg_aios-view.patch text/x-diff 9.8 KB
v2.5-0010-Refactor-read_stream.c-s-circular-arithmetic.patch text/x-diff 4.6 KB
v2.5-0011-Improve-buffer-pool-API-for-per-backend-pin-lim.patch text/x-diff 6.4 KB
v2.5-0012-Respect-pin-limits-accurately-in-read_stream.c.patch text/x-diff 9.9 KB
v2.5-0013-Support-buffer-forwarding-in-read_stream.c.patch text/x-diff 8.9 KB
v2.5-0014-Support-buffer-forwarding-in-StartReadBuffers.patch text/x-diff 10.5 KB
v2.5-0015-WIP-tests-Expand-temp-table-tests-to-some-pin-r.patch text/x-diff 9.6 KB
v2.5-0016-WIP-localbuf-Track-pincount-in-BufferDesc-as-we.patch text/x-diff 7.7 KB
v2.5-0017-bufmgr-Implement-AIO-read-support.patch text/x-diff 19.4 KB
v2.5-0018-bufmgr-Use-aio-for-StartReadBuffers.patch text/x-diff 18.8 KB
v2.5-0019-WIP-aio-read_stream.c-adjustments-for-real-AIO.patch text/x-diff 3.7 KB
v2.5-0020-aio-Add-test_aio-module.patch text/x-diff 41.7 KB
v2.5-0021-wip-ci-Increase-openbsd-kern.maxfiles-to-fix-co.patch text/x-diff 966 bytes
v2.5-0022-aio-Implement-smgr-md-fd-write-support.patch text/x-diff 12.8 KB
v2.5-0023-aio-Add-bounce-buffers.patch text/x-diff 22.8 KB
v2.5-0024-bufmgr-Implement-AIO-write-support.patch text/x-diff 6.2 KB
v2.5-0025-aio-Add-IO-queue-helper.patch text/x-diff 7.4 KB
v2.5-0026-bufmgr-use-AIO-in-checkpointer-bgwriter.patch text/x-diff 31.1 KB
v2.5-0027-Ensure-a-resowner-exists-for-all-paths-that-may.patch text/x-diff 2.6 KB
v2.5-0028-Temporary-Increase-BAS_BULKREAD-size.patch text/x-diff 1.3 KB
v2.5-0029-WIP-Use-MAP_POPULATE.patch text/x-diff 1.1 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2025-03-04 19:09:24 Re: scalability bottlenecks with (many) partitions (and more)
Previous Message Tomas Vondra 2025-03-04 18:58:38 Re: scalability bottlenecks with (many) partitions (and more)