Re: AIO v2.5

From: Noah Misch <noah(at)leadboat(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
Subject: Re: AIO v2.5
Date: 2025-03-11 19:41:08
Message-ID: 20250311194108.c5.nmisch@google.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Sep 16, 2024 at 01:51:42PM -0400, Andres Freund wrote:
> On 2024-09-16 07:43:49 -0700, Noah Misch wrote:
> > For non-sync IO methods, I gather it's essential that a process other than the
> > IO definer be scanning for incomplete IOs and completing them.

> > Otherwise, deadlocks like this would happen:
>
> > backend1 locks blk1 for non-IO reasons
> > backend2 locks blk2, starts AIO write
> > backend1 waits for lock on blk2 for non-IO reasons
> > backend2 waits for lock on blk1 for non-IO reasons
> >
> > If that's right, in worker mode, the IO worker resolves that deadlock. What
> > resolves it under io_uring? Another process that happens to do
> > pgaio_io_ref_wait() would dislodge things, but I didn't locate the code to
> > make that happen systematically.
>
> Yea, it's code that I haven't forward ported yet. I think basically
> LockBuffer[ForCleanup] ought to call pgaio_io_ref_wait() when it can't
> immediately acquire the lock and if the buffer has IO going on.

I'm not finding that code in v2.6. What function has it?

[I wrote a bunch of the subsequent comments against v2.5. I may have missed
instances of v2.6 obsoleting them.]

On Tue, Mar 04, 2025 at 02:00:14PM -0500, Andres Freund wrote:
> Attached is v2.5 of the AIO patchset.

> - Added a proper commit message fo the main commit. I'd appreciate folks
> reading through it. I'm sure I forgot a lot of folks and a lot of things.

Commit message looks fine.

> At this point I am not aware of anything significant left to do in the main
> AIO commit, safe some of the questions below.

That is a big milestone.

> Questions:
>
> - My current thinking is that we'd set io_method = worker initially - so we
> actually get some coverage - and then decide whether to switch to
> io_method=sync by default for 18 sometime around beta1/2. Does that sound
> reasonable?

Yes.

> - We could reduce memory usage a tiny bit if we made the mapping between
> pgproc and per-backend-aio-state more complicated, i.e. not just indexed by
> ProcNumber. Right now IO workers have the per-backend AIO state, but don't
> actually need it. I'm mildly inclined to think that the complexity isn't
> worth it, but on the fence.

The max memory savings, for 32 IO workers, is like the difference between
max_connections=500 and max_connections=532, right? If that's right, I
wouldn't bother in the foreseeable future.

> - Three of the commits in the series really are just precursor commits to
> their subsequent commits, which I found helpful for development and review,
> namely:
>
> - aio: Basic subsystem initialization
> - aio: Skeleton IO worker infrastructure
> - aio: Add liburing dependency
>
> Not sure if it's worth keeping these separate or whether they should just be
> merged with their "real commit".

The split aided my review. It's trivial to turn an unmerged stack of commits
into the merged equivalent, but unmerging is hard.

> - Thomas suggested renaming
> COMPLETED_IO->COMPLETED,
> COMPLETED_SHARED->TERMINATED_BY_COMPLETER,
> COMPLETED_SHARED->TERMINATED_BY_SUBMITTER
> in
> https://www.postgresql.org/message-id/CA%2BhUKGLxH1tsUgzZfng4BU6GqnS6bKF2ThvxH1_w5c7-sLRKQw%40mail.gmail.com
>
> While the other things in the email were commented upon by others and
> addressed in v2.4, the naming aspect wasn't further remarked upon by others.
> I'm not personally in love with the suggested names, but I could live with
> them.

I, too, could live with those. None of these naming proposals bother me, and
I would not have raised the topic myself. If I were changing it further, I'd
use these principles:

- use COMPLETED or TERMINATED, not both
- I like COMPLETED, because _complete_ works well in a function name.
_terminate_ sounds more like an abnormal interruption.
- If one state name lacks a suffix, it should be the final state.

So probably one of:

{COMPLETED,TERMINATED,FINISHED,REAPED,DONE}_{KERN,RETURN,RETVAL,ERRNO}
{COMPLETED,TERMINATED,FINISHED,REAPED,DONE}_{SHMEM,SHARED}
{COMPLETED,TERMINATED,FINISHED,REAPED,DONE}{_SUBMITTER,}

If it were me picking today, I'd pick:

COMPLETED_RETURN
COMPLETED_SHMEM
COMPLETED

> - Right now this series defines PGAIO_VERBOSE to 1. That's good for debugging,
> but all the ereport()s add a noticeable amount of overhead at high IO
> throughput (at multiple gigabytes/second), so that's probably not right
> forever. I'd leave this on initially and then change it to default to off
> later. I think that's ok?

Sure. Perhaps make it depend on USE_ASSERT_CHECKING later?

> - To allow io_workers to be PGC_SIGHUP, and to eventually allow to
> automatically in/decrease active workers, the max number of workers (32) is
> always allocated. That means we use more semaphores than before. I think
> that's ok, it's not 1995 anymore. Alternatively we can add a
> "io_workers_max" GUC and probe for it in initdb.

Let's start as you have it. If someone wants to make things perfect for
non-root BSD users, they can add the GUC later. io_method=sync is a
sufficient backup plan indefinitely.

> - pg_stat_aios currently has the IO Handle flags as dedicated columns. Not
> sure that's great?
>
> They could be an enum array or such too? That'd perhaps be a bit more
> extensible? OTOH, we don't currently use enums in the catalogs and arrays
> are somewhat annoying to conjure up from C.

An enum array does seem elegant and extensible, but it has the problems you
say. (I would expect to lose time setting up pg_enum.oid values to not change
between releases.) A possible compromise would be a text array like
heap_tuple_infomask_flags() does. Overall, I'm not seeing a clear need to
change away from the bool columns.

> Todo:

> - Figure out how to deduplicate support for LockBufferForCleanup() in
> TerminateBufferIO().

Yes, I agree there's an opportunity for a WakePinCountWaiter() or similar
subroutine.

> - Check if documentation for track_io_timing needs to be adjusted, after the
> bufmgr.c changes we only track waiting for an IO.

Yes.

On Mon, Mar 10, 2025 at 02:23:12PM -0400, Andres Freund wrote:
> Attached is v2.6 of the AIO patchset.

> - 0005, 0006 - io_uring support - close, but we need to do something about
> set_max_fds(), which errors out spuriously in some cases

What do we know about those cases? I don't see a set_max_fds(); is that
set_max_safe_fds(), or something else?

> - 0025 - tests for AIO - I think it's reasonable, unless somebody objects to
> exporting a few bufmgr.c functions to the test

I'll essentially never object to that.

> + * AIO handles need be registered in critical sections and therefore
> + * cannot use the normal ResoureElem mechanism.

s/ResoureElem/ResourceElem/

> + <varlistentry id="guc-io-method" xreflabel="io_method">
> + <term><varname>io_method</varname> (<type>enum</type>)
> + <indexterm>
> + <primary><varname>io_method</varname> configuration parameter</primary>
> + </indexterm>
> + </term>
> + <listitem>
> + <para>
> + Selects the method for executing asynchronous I/O.
> + Possible values are:
> + <itemizedlist>
> + <listitem>
> + <para>
> + <literal>sync</literal> (execute asynchronous I/O synchronously)

The part in parentheses reads like a contradiction to me. How about phrasing
it like one of these:

(execute I/O synchronously, even I/O eligible for asynchronous execution)
(execute asynchronous-eligible I/O synchronously)
(execute I/O synchronously, even when asynchronous execution was feasible)

> + * This could be in aio_internal.h, as it is not pubicly referenced, but

typo -> publicly

> + * On what is IO being performed.

End sentence with question mark, probably.

> + * List of in-flight IOs. Also contains IOs that aren't strict speaking

s/strict/strictly/

> + /*
> + * Start executing passed in IOs.
> + *
> + * Will not be called if ->needs_synchronous_execution() returned true.
> + *
> + * num_staged_ios is <= PGAIO_SUBMIT_BATCH_SIZE.
> + *

I recommend adding "Always called in a critical section." since at least
pgaio_worker_submit() subtly needs it.

> + */
> + int (*submit) (uint16 num_staged_ios, PgAioHandle **staged_ios);

> + * Each backend can only have one AIO handle that that has been "handed out"

s/that that/that/

> + * AIO, it typically will pass the handle to smgr., which will pass it on to

s/smgr.,/smgr.c,/ or just "smgr"

> +PgAioHandle *
> +pgaio_io_acquire_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret)
> +{
> + if (pgaio_my_backend->num_staged_ios >= PGAIO_SUBMIT_BATCH_SIZE)
> + {
> + Assert(pgaio_my_backend->num_staged_ios == PGAIO_SUBMIT_BATCH_SIZE);
> + pgaio_submit_staged();

I'm seeing the "num_staged_ios >= PGAIO_SUBMIT_BATCH_SIZE" case uncovered in a
check-world coverage report. I tried PGAIO_SUBMIT_BATCH_SIZE=2,
io_max_concurrency=1, and io_max_concurrency=64. Do you already have a recipe
for reaching this case?

> +/*
> + * Stage IO for execution and, if necessary, submit it immediately.
> + *
> + * Should only be called from pgaio_io_prep_*().
> + */
> +void
> +pgaio_io_stage(PgAioHandle *ioh, PgAioOp op)
> +{

We've got closely-associated verbs "prepare", "prep", and "stage". README.md
doesn't mention "stage". Can one of the following two changes happen?

- README.md starts mentioning "stage" and how it differs from the others
- Code stops using "stage"

> + * locallbacks just before reclaiming at multiple callsites.

s/locallbacks/local callbacks/

> + * Check if the the referenced IO completed, without blocking.

s/the the/the/

> + * Batch submission mode needs to explicitly ended with
> + * pgaio_exit_batchmode(), but it is allowed to throw errors, in which case
> + * error recovery will end the batch.

This sentence needs some grammar help, I think. Maybe use:

* End batch submission mode with pgaio_exit_batchmode(). (Throwing errors is
* allowed; error recovery will end the batch.)

> Size
> AioShmemSize(void)
> {
> Size sz = 0;
>
> + /*
> + * We prefer to report this value's source as PGC_S_DYNAMIC_DEFAULT.
> + * However, if the DBA explicitly set wal_buffers = -1 in the config file,

s/wal_buffers/io_max_concurrency/

> +extern int io_workers;

By the rule that GUC vars are PGDLLIMPORT, this should be PGDLLIMPORT.

> +static void
> +maybe_adjust_io_workers(void)

This also restarts workers that exit, so perhaps name it
start_io_workers_if_missing().

> +{
...
> + /* Try to launch one. */
> + child = StartChildProcess(B_IO_WORKER);
> + if (child != NULL)
> + {
> + io_worker_children[id] = child;
> + ++io_worker_count;
> + }
> + else
> + break; /* XXX try again soon? */

Can LaunchMissingBackgroundProcesses() become the sole caller of this
function, replacing the current mix of callers? That would be more conducive
to promptly doing the right thing after launch failure.

> --- a/src/backend/utils/init/miscinit.c
> +++ b/src/backend/utils/init/miscinit.c
> @@ -293,6 +293,9 @@ GetBackendTypeDesc(BackendType backendType)
> case B_CHECKPOINTER:
> backendDesc = gettext_noop("checkpointer");
> break;
> + case B_IO_WORKER:
> + backendDesc = "io worker";

Wrap in gettext_noop() like B_CHECKPOINTER does.

> + Only has an effect if <xref linkend="guc-max-wal-senders"/> is set to
> + <literal>worker</literal>.

s/guc-max-wal-senders/guc-io-method/

> + * of IOs, wakeups "fan out"; each woken IO worker can wake two more. qXXX

s/qXXX/XXX/

> + /*
> + * It's very unlikely, but possible, that reopen fails. E.g. due
> + * to memory allocations failing or file permissions changing or
> + * such. In that case we need to fail the IO.
> + *
> + * There's not really a good errno we can report here.
> + */
> + error_errno = ENOENT;

Agreed there's not a good errno, but let's use a fake errno that we're mighty
unlikely to confuse with an actual case of libc returning that errno. Like
one of EBADF or EOWNERDEAD.

> + for (int contextno = 0; contextno < TotalProcs; contextno++)
> + {
> + PgAioUringContext *context = &pgaio_uring_contexts[contextno];
> + int ret;
> +
> + /*
> + * XXX: Probably worth sharing the WQ between the different rings,
> + * when supported by the kernel. Could also cause additional
> + * contention, I guess?
> + */
> +#if 0
> + if (!AcquireExternalFD())
> + elog(ERROR, "No external FD available");
> +#endif

Probably remove the "#if 0" or add a comment on why it's here.

> + ret = io_uring_submit(uring_instance);
> + pgstat_report_wait_end();
> +
> + if (ret == -EINTR)
> + {
> + pgaio_debug(DEBUG3,
> + "aio method uring: submit EINTR, nios: %d",
> + num_staged_ios);
> + }
> + else if (ret < 0)
> + elog(PANIC, "failed: %d/%s",
> + ret, strerror(-ret));

I still think (see 2024-09-16 review) EAGAIN should do the documented
recommendation instead of PANIC:

EAGAIN The kernel was unable to allocate memory for the request, or
otherwise ran out of resources to handle it. The application should wait for
some completions and try again.

At a minimum, it deserves a comment like "We accept PANIC on memory exhaustion
here."

> + pgstat_report_wait_end();
> +
> + if (ret == -EINTR)
> + {
> + continue;
> + }
> + else if (ret != 0)
> + {
> + elog(PANIC, "unexpected: %d/%s: %m", ret, strerror(-ret));

I think errno isn't meaningful here, so %m doesn't belong.

> --- a/doc/src/sgml/config.sgml
> +++ b/doc/src/sgml/config.sgml
> @@ -2687,6 +2687,12 @@ include_dir 'conf.d'
> <literal>worker</literal> (execute asynchronous I/O using worker processes)
> </para>
> </listitem>
> + <listitem>
> + <para>
> + <literal>io_uring</literal> (execute asynchronous I/O using
> + io_uring, if available)
> + </para>
> + </listitem>

Docs should eventually cover RLIMIT_MEMLOCK per
https://github.com/axboe/liburing "ulimit settings". Maybe RLIMIT_NOFILE,
too.

> @@ -2498,6 +2529,12 @@ FilePathName(File file)
> int
> FileGetRawDesc(File file)
> {
> + int returnCode;
> +
> + returnCode = FileAccess(file);
> + if (returnCode < 0)
> + return returnCode;
> +
> Assert(FileIsValid(file));
> return VfdCache[file].fd;
> }

What's the rationale for this function's change?

> +The main reason to want to use Direct IO are:

> +The main reason *not* to use Direct IO are:

x2 s/main reason/main reasons/

> + and direct IO without O_DSYNC needs to issue a write and after the writes
> + completion a cache cache flush, whereas O\_DIRECT + O\_DSYNC can use a

s/writes/write's/

> + single FUA write).

I recommend including the acronym expansion: s/FUA/Force Unit Access (FUA)/

> +In an `EXEC_BACKEND` build backends executable code and other process local

s/backends/backends'/

> +state is not necessarily mapped to the same addresses in each process due to
> +ASLR. This means that the shared memory cannot contain pointer to callbacks.

s/pointer/pointers/

> +The "solution" to this the ability to associate multiple completion callbacks
> +with a handle. E.g. bufmgr.c can have a callback to update the BufferDesc
> +state and to verify the page and md.c. another callback to check if the IO
> +operation was successful.

One of these or similar:
s/md.c. another/md.c can have another/
s/md.c. /md.c /

I've got one high-level question that I felt could take too long to answer for
myself by code reading. What's the cleanup story if process A does
elog(FATAL) with unfinished I/O? Specifically:

- Suppose some other process B reuses the shared memory AIO data structures
that pertained to process A. After that, some process C completes the I/O
in shmem. Do we avoid confusing B by storing local callback data meant for
A in shared memory now pertaining to B?

- Thinking more about this README paragraph:

+In addition to completion, AIO callbacks also are called to "prepare" an
+IO. This is, e.g., used to increase buffer reference counts to account for the
+AIO subsystem referencing the buffer, which is required to handle the case
+where the issuing backend errors out and releases its own pins while the IO is
+still ongoing.

Which function performs that reference count increase? I'm not finding it
today. I wanted to look at how it ensures the issuing backend still exists
as the function increases the reference count.

One later-patch item:

> +static PgAioResult
> +SharedBufferCompleteRead(int buf_off, Buffer buffer, uint8 flags, bool failed)
> +{
...
> + TRACE_POSTGRESQL_BUFFER_READ_DONE(tag.forkNum,
> + tag.blockNum,
> + tag.spcOid,
> + tag.dbOid,
> + tag.relNumber,
> + INVALID_PROC_NUMBER,
> + false);

I wondered about whether the buffer-read-done probe should happen in the
process that calls the complete_shared callback or in the process that did the
buffer-read-start probe. When I see dtrace examples, they usually involve
explicitly naming each PID to trace. Assuming that's indeed the norm, I think
the local callback would be the better place, so a given trace contains both
probes. If it were reasonable to dtrace all current and future postmaster
kids, that would argue for putting the probe in the complete_shared callback.
Alternatively, would could argue for separate probes buffer-read-done-shmem
and buffer-read-done.

Thanks,
nm

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David G. Johnston 2025-03-11 19:44:14 Re: Document NULL
Previous Message Marcos Pegoraro 2025-03-11 19:38:55 Re: Document NULL