From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Cc: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com> |
Subject: | Re: AIO v2.3 |
Date: | 2025-01-23 04:29:02 |
Message-ID: | p5flybfcjrzgpgwcnzkxsp3xdw3czfoafkvks6g7en6ewmvymh@wrz7z2vwwc7c |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
Attached is v2.3.
There are a lot of changes - primarily renaming things based on on-list and
off-list feedback. But also some other things
Functional:
- Added pg_aios view
- md.c registering sync requests, that was previously omitted
- This triggered stats issues during shutdown, as it can lead to IO workers
emitting stats in some corner cases. I've written a patch series to
address this [1]. For now I've included them in this patchset, but I would
like to push the reordering patches soon.
- Testing error handling for temp table IO made me realize that the previous
pattern of just tracking the refcount held by the IO subsystem in the
LocalRefCount array leads to spurious buffer leak warnings [2]. I attached
a prototype patch to deal with this by bringing localbuf.c more in line with
bufmgr.c, but it needs some cleanup.
That's in v2.3-0020-WIP-localbuf-Track-pincount-in-BufferDesc-as-we.patch
- Wait for all IOs to finish during shutdown. This is primarily required to
ensure there aren't IOs initiated by a prior "owner" of a ProcNumber when a
new backend starts. But there are also some kernels that don't like when
exiting while IO is in flight.
- Re-armed local completion callbacks, they're required for correctness of
temporary table IO
- Added a bunch of central debug helpers that only lead to output if
PGAIO_VERBOSE is defined. That did make code a good bit more readable.
Polishing:
- Lots of copy editing, a lot of it thanks to feedback by Noah and Heikki
- Renamed the previous concept of a "subject" of an IO (i.e. what the IO is
executed on, an smgr relation, a WAL file, ...) to "target". I'm not in
love with that name, but I went through dozens of variations, and it does
seem better than subject.
Not sure anymore how I ended up with subject, it's grammatically off and not
very descriptive to boot.
- Renamed "PgAioHandleRef" and related functions to
PgAioWaitRef/pgaio_wref_*(), that seems a lot more descriptive.
- Renamed pgaio_io_get() to pgaio_io_acquire()
- Renamed the IO handle states (PREPARED to STAGED, IN_FLIGHT to SUBMITTED,
REAPED to COMPLETED_IO).
Particularly the various COMPLETED state names aren't necessarily final,
I've been debating a bunch of variations with Thomas and Robert
- Renamed aio_ref.h to aio_types.h, moved a few more types into it.
- Renamed completion callbacks to not use "shared" anymore - ->prepare was not
really shared and now local callbacks are back (in a restricted form).
s/PgAioHandleSharedCallback/PgAioHandleCallback/
s/pgaio_io_add_shared_cb/pgaio_io_register_callbacks/
Not entirely sure *register_callbacks is the best, happy to adjust.
- Renamed the ->error IO handle callback to ->report
Also renamed s/pgaio_result_log/pgaio_result_report/g
- Renamed the ->prepare IO handle callback to ->stage
Also renamed s/pgaio_result_log/pgaio_result_report/g
- Partially addressed request to reorder aio/README.md
- Determine shared memory allocation size with PG_IOV_MAX not io_combine_limit
io_combine_limit is USERSET, so it's not correct to use it for shmem
allocations. I chose PG_IOV_MAX instead of MAX_IO_COMBINE_LIMIT because this
is a more generic limit than bufmgr.c IO.
- Prefix PgAio* enums with PGAIO_, global variables with pgaio_*
- Split out callback related cod from aio_subject.c (now aio_target.c) into
aio_callback.c. The target specific code is rather small, so this makes a
lot more sense.
- Distributed functions into more appropriate .c files, documented the choice
in aio.h, reorder them
- Disowned lwlock: More consistent naming, reduce diff size, resume interrupts
Heikki asked to clear ->owner when disowning the lock - but we currently
*never* clear it doesn't seem right to do so when disowning the lock.
- IO data that can be set on a handle (to e.g. transport an array of Buffers
to the completion callbacks) is now done with
pgaio_io_(get|set)_handle_data(). Mainly to distinguish it from data that's
actually the target/source of a read/write.
Heikki suggested to make this per-callback data, but I don't think there's
currently a use case for that, and it'd add a fair bit of memory overhead. I
added a comment documenting this.
- Lots of other cleanups, added comments and the like
Todo:
- Reorder README further
- Make per backend state not indexed by ProcNumber, as that requires reserving
per-backend state for IO workers, which will never need them
- Clean up localbuf.c "preparation" patches
- Add more tests - I had hoped to get to this, but got sidetracked with a
bunch of things I found while testing
- I started looking into having a distinct type for the public pgaio_io_*
related functions that can be used just by the issuer of the IO. It does
make things a bit easier to understand, but also complicates naming. Not
sure if it's worth it yet.
- Need to define (and test) the behavior when an IO worker fails to reopen the
file for an IO
- Heikki doesn't love pgaio_submit_staged(), suggested pgaio_kick_staged() or
such. I don't love that name though.
- There's some duplicated code in aio_callback.c, it'd be nice to deduplicate
the callback invaction of the different callbacks
- Local callbacks are triggered from within pgaio_io_reclaim(), that's not
exactly pretty. But it's currently the most central place to deal with the
case of IOs for which the shared completion callback was called in another
backend.
- As Jakub suggested (below [3]), when io_method=io_uring is used, we can run
of out file descriptors much more easily. At the very least we need a good
error message, perhaps also some rlimit adjusting (probably as a second
step, if so).
- Thomas is working on the read_stream.c <-> bufmgr.c integration piece
- Start to write docs adjustments
[1] https://postgr.es/m/kgng5nrvnlv335evmsuvpnh354rw7qyazl73kdysev2cr2v5zu%40m3cfzxicm5kp
[2] https://postgr.es/m/j6hny5ivrfqw356ugoy3ti5ccadamluekxod4k6amao5snew6c%40t5h3bwhrgfqx
[3] https://postgr.es/m/tp63m6tcbi7mmsjlqgxd55sghhwvjxp3mkgeljffkbaujezvdl%40fvmdr3c6uhat
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Smith | 2025-01-23 04:55:12 | Re: Adding a '--two-phase' option to 'pg_createsubscriber' utility. |
Previous Message | vignesh C | 2025-01-23 04:28:09 | Re: Pgoutput not capturing the generated columns |