Re: AIO v2.0

From: Noah Misch <noah(at)leadboat(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, 陈宗志 <baotiao(at)gmail(dot)com>
Subject: Re: AIO v2.0
Date: 2024-09-16 14:43:49
Message-ID: 20240916144349.74.nmisch@google.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Sep 06, 2024 at 03:38:16PM -0400, Andres Freund wrote:
> There's plenty more to do, but I thought this would be a useful checkpoint.

I find patches 1-5 are Ready for Committer.

> +typedef enum PgAioHandleState

This enum clarified a lot for me, so I wish I had read it before anything
else. I recommend referring to it in README.md. Would you also cover the
valid state transitions and which of them any backend can do vs. which are
specific to the defining backend?

> +{
> + /* not in use */
> + AHS_IDLE = 0,
> +
> + /* returned by pgaio_io_get() */
> + AHS_HANDED_OUT,
> +
> + /* pgaio_io_start_*() has been called, but IO hasn't been submitted yet */
> + AHS_DEFINED,
> +
> + /* subjects prepare() callback has been called */
> + AHS_PREPARED,
> +
> + /* IO is being executed */
> + AHS_IN_FLIGHT,

Let's align terms between functions and states those functions reach. For
example, I recommend calling this state AHS_SUBMITTED, because
pgaio_io_prepare_submit() is the function reaching this state.
(Alternatively, use in_flight in the function name.)

> +
> + /* IO finished, but result has not yet been processed */
> + AHS_REAPED,
> +
> + /* IO completed, shared completion has been called */
> + AHS_COMPLETED_SHARED,
> +
> + /* IO completed, local completion has been called */
> + AHS_COMPLETED_LOCAL,
> +} PgAioHandleState;

> +void
> +pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error)
> +{
> + PgAioHandle *ioh = dlist_container(PgAioHandle, resowner_node, ioh_node);
> +
> + Assert(ioh->resowner);
> +
> + ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
> + ioh->resowner = NULL;
> +
> + switch (ioh->state)
> + {
> + case AHS_IDLE:
> + elog(ERROR, "unexpected");
> + break;
> + case AHS_HANDED_OUT:
> + Assert(ioh == my_aio->handed_out_io || my_aio->handed_out_io == NULL);
> +
> + if (ioh == my_aio->handed_out_io)
> + {
> + my_aio->handed_out_io = NULL;
> + if (!on_error)
> + elog(WARNING, "leaked AIO handle");
> + }
> +
> + pgaio_io_reclaim(ioh);
> + break;
> + case AHS_DEFINED:
> + case AHS_PREPARED:
> + /* XXX: Should we warn about this when is_commit? */

Yes.

> + pgaio_submit_staged();
> + break;
> + case AHS_IN_FLIGHT:
> + case AHS_REAPED:
> + case AHS_COMPLETED_SHARED:
> + /* this is expected to happen */
> + break;
> + case AHS_COMPLETED_LOCAL:
> + /* XXX: unclear if this ought to be possible? */
> + pgaio_io_reclaim(ioh);
> + break;
> + }

> +void
> +pgaio_io_ref_wait(PgAioHandleRef *ior)
> +{
> + uint64 ref_generation;
> + PgAioHandleState state;
> + bool am_owner;
> + PgAioHandle *ioh;
> +
> + ioh = pgaio_io_from_ref(ior, &ref_generation);
> +
> + am_owner = ioh->owner_procno == MyProcNumber;
> +
> +
> + if (pgaio_io_was_recycled(ioh, ref_generation, &state))
> + return;
> +
> + if (am_owner)
> + {
> + if (state == AHS_DEFINED || state == AHS_PREPARED)
> + {
> + /* XXX: Arguably this should be prevented by callers? */
> + pgaio_submit_staged();

Agreed for AHS_DEFINED, if not both. AHS_DEFINED here would suggest a past
longjmp out of pgaio_io_prepare() w/o a subxact rollback to cleanup. Even so,
the next point might remove the need here:

> +void
> +pgaio_io_prepare(PgAioHandle *ioh, PgAioOp op)
> +{
> + Assert(ioh->state == AHS_HANDED_OUT);
> + Assert(pgaio_io_has_subject(ioh));
> +
> + ioh->op = op;
> + ioh->state = AHS_DEFINED;
> + ioh->result = 0;
> +
> + /* allow a new IO to be staged */
> + my_aio->handed_out_io = NULL;
> +
> + pgaio_io_prepare_subject(ioh);
> +
> + ioh->state = AHS_PREPARED;

As defense in depth, let's add a critical section from before assigning
AHS_DEFINED to here. This code already needs to be safe for that (per
README.md). When running outside a critical section, an ERROR in a subject
callback could leak the lwlock disowned in shared_buffer_prepare_common(). I
doubt there's a plausible way to reach that leak today, but future subject
callbacks could add risk over time.

> +if test "$with_liburing" = yes; then
> + PKG_CHECK_MODULES(LIBURING, liburing)
> +fi

I used the attached makefile patch to build w/ liburing.

> +pgaio_uring_shmem_init(bool first_time)
> +{
> + uint32 TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS - MAX_IO_WORKERS;
> + bool found;
> +
> + aio_uring_contexts = (PgAioUringContext *)
> + ShmemInitStruct("AioUring", pgaio_uring_shmem_size(), &found);
> +
> + if (found)
> + return;
> +
> + for (int contextno = 0; contextno < TotalProcs; contextno++)
> + {
> + PgAioUringContext *context = &aio_uring_contexts[contextno];
> + int ret;
> +
> + /*
> + * XXX: Probably worth sharing the WQ between the different rings,
> + * when supported by the kernel. Could also cause additional
> + * contention, I guess?
> + */
> +#if 0
> + if (!AcquireExternalFD())
> + elog(ERROR, "No external FD available");
> +#endif
> + ret = io_uring_queue_init(io_max_concurrency, &context->io_uring_ring, 0);

With EXEC_BACKEND, "make check PG_TEST_INITDB_EXTRA_OPTS=-cio_method=io_uring"
fails early:

2024-09-15 12:46:08.168 PDT postmaster[2069397] LOG: starting PostgreSQL 18devel on x86_64-pc-linux-gnu, compiled by gcc (Debian 13.2.0-13) 13.2.0, 64-bit
2024-09-15 12:46:08.168 PDT postmaster[2069397] LOG: listening on Unix socket "/tmp/pg_regress-xgQOPH/.s.PGSQL.65312"
2024-09-15 12:46:08.203 PDT startup[2069423] LOG: database system was shut down at 2024-09-15 12:46:07 PDT
2024-09-15 12:46:08.209 PDT client backend[2069425] [unknown] FATAL: the database system is starting up
2024-09-15 12:46:08.222 PDT postmaster[2069397] LOG: database system is ready to accept connections
2024-09-15 12:46:08.254 PDT autovacuum launcher[2069435] PANIC: failed: -9/Bad file descriptor
2024-09-15 12:46:08.286 PDT client backend[2069444] [unknown] PANIC: failed: -95/Operation not supported
2024-09-15 12:46:08.355 PDT client backend[2069455] [unknown] PANIC: unexpected: -95/Operation not supported: No such file or directory
2024-09-15 12:46:08.370 PDT postmaster[2069397] LOG: received fast shutdown request

I expect that's from io_uring_queue_init() stashing in shared memory a file
descriptor and mmap address, which aren't valid in EXEC_BACKEND children.
Reattaching descriptors and memory in each child may work, or one could just
block io_method=io_uring under EXEC_BACKEND.

> +pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
> +{
> + struct io_uring *uring_instance = &my_shared_uring_context->io_uring_ring;
> +
> + Assert(num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
> +
> + for (int i = 0; i < num_staged_ios; i++)
> + {
> + PgAioHandle *ioh = staged_ios[i];
> + struct io_uring_sqe *sqe;
> +
> + sqe = io_uring_get_sqe(uring_instance);
> +
> + pgaio_io_prepare_submit(ioh);
> + pgaio_uring_sq_from_io(ioh, sqe);
> + }
> +
> + while (true)
> + {
> + int ret;
> +
> + pgstat_report_wait_start(WAIT_EVENT_AIO_SUBMIT);
> + ret = io_uring_submit(uring_instance);
> + pgstat_report_wait_end();
> +
> + if (ret == -EINTR)
> + {
> + elog(DEBUG3, "submit EINTR, nios: %d", num_staged_ios);
> + continue;
> + }

Since io_uring_submit() is a wrapper around io_uring_enter(), this should also
retry on EAGAIN. "man io_uring_enter" has:

EAGAIN The kernel was unable to allocate memory for the request, or
otherwise ran out of resources to handle it. The application should wait
for some completions and try again.

> +FileStartWriteV(struct PgAioHandle *ioh, File file,
> + int iovcnt, off_t offset,
> + uint32 wait_event_info)
> +{
> + int returnCode;
> + Vfd *vfdP;
> +
> + Assert(FileIsValid(file));
> +
> + DO_DB(elog(LOG, "FileStartWriteV: %d (%s) " INT64_FORMAT " %d",
> + file, VfdCache[file].fileName,
> + (int64) offset,
> + iovcnt));
> +
> + returnCode = FileAccess(file);
> + if (returnCode < 0)
> + return returnCode;
> +
> + vfdP = &VfdCache[file];
> +
> + /* FIXME: think about / reimplement temp_file_limit */
> +
> + pgaio_io_prep_writev(ioh, vfdP->fd, iovcnt, offset);
> +
> + return 0;
> +}

FileStartWriteV() gets to state AHS_PREPARED, so let's align with the state
name by calling it FilePrepareWriteV (or FileWriteVPrepare or whatever).

For non-sync IO methods, I gather it's essential that a process other than the
IO definer be scanning for incomplete IOs and completing them. Otherwise,
deadlocks like this would happen:

backend1 locks blk1 for non-IO reasons
backend2 locks blk2, starts AIO write
backend1 waits for lock on blk2 for non-IO reasons
backend2 waits for lock on blk1 for non-IO reasons

If that's right, in worker mode, the IO worker resolves that deadlock. What
resolves it under io_uring? Another process that happens to do
pgaio_io_ref_wait() would dislodge things, but I didn't locate the code to
make that happen systematically. Could you add a mention of "deadlock" in the
comment at whichever code achieves that?

I could share more-tactical observations about patches 6-20, but they're
probably things you'd change without those observations. Is there any
specific decision you'd like to settle before patch 6 exits WIP?

Thanks,
nm

Attachment Content-Type Size
uring-makefile-v1.patch text/plain 842 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2024-09-16 15:19:42 Re: Regression tests fail with tzdata 2024b
Previous Message Bertrand Drouvot 2024-09-16 14:33:20 Re: Add contrib/pg_logicalsnapinspect