Re: AIO v2.2

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: AIO v2.2
Date: 2025-01-01 04:03:33
Message-ID: bgixmidc73doecg7wskq3k76g3nqnglqub7irbrwp4ppjsx43j@fwre2x775mcl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Attached is a new version of the AIO patchset.

The biggest changes are:

- The README has been extended with an overview of the API. I think it gives a
good overview of how the API fits together. I'd be very good to get
feedback from folks that aren't as familiar with AIO, I can't really see
what's easy/hard anymore.

- The read/write patches and the bounce buffer patches are split out, so that
there's no dependency between the first few AIO patches and the "don't dirty
while IO is going on" patcheset [1].

- Retries for partial IOs (i.e. short reads) are now implemented. Turned out
to take all of three lines and adding one missing variable initialization.

- I added quite a lot of function-header and file-header comments. There's
more to be done here, but see also the TODO section below.

- IO stats are now tracked. Specifically, the "time" for an IO is now the time
spent waiting for an IO, as discussed around [2]. I haven't updated the
docs yet.

- There now is a fastpath for executing AIO "synchronously", i.e. preparing an
IO and immediately submitting it.

- Previously one needed very large effective_io_concurrency values to get
sufficient asynchronous IO for sequential scans, as read_stream.c limited
max_pinned_buffers to effective_io_concurrency * 4. Unless
effective_io_concurrency was very high, that'd only allow a single IO to be
in-flight, due to io_combine_limit buffers getting merged into one IO.

Instead the pin limit is now capped by effective_io_concurrency *
io_combine_limit.

Right now that's part of one larger "hack up read_stream.c" commit, Thomas
said he'd take a look at how to do this properly. This is probably
something we could and should commit separately.

- io_method = sync has been made more similar to the way IO happens today. In
particular, we now continue to issue prefetch requests and the actual IO is
done only within WaitReadBuffers().

- When using buffered IO with io_uring, there previously was a small
regression, due to more IO happening in the process context with io_uring
(instead of in a kernel thread). While one could argue that it's better to
not increase CPU usage beyond one process, I don't find that sufficiently
convincing. To work around that I added a heuritic that tells IO uring to
execute IOs using it's worker infrastructure. That seems to have fixed this
problem entirely.

- IO worker infrastructure was cleaned up

- I pushed a few minor preliminary commits a while ago

- lots of other smaller stuff

The biggest TODOs are:

- Right now the API between bufmgr.c and read_stream.c kind of necessitates
that one StartReadBuffers() call actually can trigger multiple IOs, if
one of the buffers was read in by another backend, before "this" backend
called StartBufferIO().

I think Thomas and I figured out a way to evolve the interface so that this
isn't necessary anymore:

We allow StartReadBuffers() to memorize buffers it pinned but didn't
initiate IO on in the buffers[] argument. The next call to StartReadBuffers
then doesn't have to repin thse buffers. That doesn't just solve the
multiple-IOs for one "read operation" issue, it also make the - very common
- case of a bunch of "buffer misses" followed by a "buffer hit" cleaner, the
hit wouldn't be tracked in the same ReadBuffersOperation anymore.

- Right now bufmgr.h includes aio.h, because it needs to include a reference
to the AIO's result in ReadBuffersOperation. Requiring a dynamic allocation
would be noticeable overhead, so that's not an option. I think the best
option here would be to introduce something like aio_types.h, so fewer
things are included.

- There's no obvious way to tell "internal" function operating on an IO handle
apart from functions that are expected to be called by the issuer of an IO.

One way to deal with this would be to introduce a distinct "issuer IO
reference" type. I think that might be a good idea, it would also make it
clearer that a good number of the functions can only be called by the
issuer, before the IO is submitted.

This would also make it easier to order functions more sensibly in aio.c, as
all the issuer functions would be together.

The functions on AIO handles that everyone can call already have a distinct
type (PgAioHandleRef vs PgAioHandle*).

- While I've added a lot of comments, I only got so far adding them. More are
needed.

- The naming around PgAioReturn, PgAioResult, PgAioResultStatus needs to be
improved

- The debug logging functions are a bit of a mess, lots of very similar code
in lots of places. I think AIO needs a few ereport() wrappers to make this
easier.

- More tests are needed. None of our current test frameworks really makes this
easy :(.

- Several folks asked for pg_stat_aio to come back, in "v1" that showed the
set of currently in-flight AIOs. That's not particularly hard - except
that it doesn't really fit in the pg_stat_* namespace.

- I'm not sure that effective_io_concurrency as we have it right now really
makes sense, particularly not with the current default values. But that's a
mostly independent change.

Greetings,

Andres Freund

[1] https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf%40gcnactj4z56m
[2] https://postgr.es/m/tp63m6tcbi7mmsjlqgxd55sghhwvjxp3mkgeljffkbaujezvdl%40fvmdr3c6uhat

Attachment Content-Type Size
v2-0001-Ensure-a-resowner-exists-for-all-paths-that-may-p.patch text/x-diff 2.6 KB
v2-0002-Allow-lwlocks-to-be-unowned.patch text/x-diff 5.0 KB
v2-0003-aio-Basic-subsystem-initialization.patch text/x-diff 9.8 KB
v2-0004-aio-Core-AIO-implementation.patch text/x-diff 63.5 KB
v2-0005-aio-Skeleton-IO-worker-infrastructure.patch text/x-diff 22.2 KB
v2-0006-aio-Add-worker-method.patch text/x-diff 17.6 KB
v2-0007-aio-Add-liburing-dependency.patch text/x-diff 9.9 KB
v2-0008-aio-Add-io_uring-method.patch text/x-diff 14.1 KB
v2-0009-aio-Add-README.md-explaining-higher-level-design.patch text/x-diff 18.0 KB
v2-0010-aio-Implement-smgr-md.c-aio-methods.patch text/x-diff 25.6 KB
v2-0011-bufmgr-Implement-AIO-read-support.patch text/x-diff 19.2 KB
v2-0012-bufmgr-Use-aio-for-StartReadBuffers.patch text/x-diff 19.0 KB
v2-0013-aio-Very-WIP-read_stream.c-adjustments-for-real-A.patch text/x-diff 4.9 KB
v2-0014-aio-Add-bounce-buffers.patch text/x-diff 20.6 KB
v2-0015-bufmgr-Implement-AIO-write-support.patch text/x-diff 5.7 KB
v2-0016-aio-Add-IO-queue-helper.patch text/x-diff 7.2 KB
v2-0017-bufmgr-use-AIO-in-checkpointer-bgwriter.patch text/x-diff 31.6 KB
v2-0018-very-wip-test_aio-module.patch text/x-diff 42.2 KB
v2-0019-Temporary-Increase-BAS_BULKREAD-size.patch text/x-diff 1.3 KB
v2-0020-WIP-Use-MAP_POPULATE.patch text/x-diff 1.1 KB

In response to

  • AIO v2.0 at 2024-09-01 06:27:50 from Andres Freund

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2025-01-01 05:21:34 Re: Strange issue with NFS mounted PGDATA on ugreen NAS
Previous Message Tom Lane 2024-12-31 23:58:14 Re: Strange issue with NFS mounted PGDATA on ugreen NAS