Quick Links

Re: AIO v2.5

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Antonin Houska <ah(at)cybertec(dot)at>, pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
Subject:	Re: AIO v2.5
Date:	2025-03-19 21:25:30
Message-ID:	20250319212530.80.nmisch@google.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Mar 12, 2025 at 01:06:03PM -0400, Andres Freund wrote:
> On 2025-03-11 20:57:43 -0700, Noah Misch wrote:
> > - Like you say, "redefine max_files_per_process to be about the number of
> > files each *backend* will additionally open". It will become normal that
> > each backend's actual FD list length is max_files_per_process + MaxBackends
> > if io_method=io_uring. Outcome is not unlike
> > v6-0002-Bump-postmaster-soft-open-file-limit-RLIMIT_NOFIL.patch +
> > v6-0003-Reflect-the-value-of-max_safe_fds-in-max_files_pe.patch but we don't
> > mutate max_files_per_process. Benchmark results should not change beyond
> > the inter-major-version noise level unless one sets io_method=io_uring. I'm
> > feeling best about this one, but I've not been thinking about it long.
>
> Yea, I think that's something probably worth doing separately from Jelte's
> patch. I do think that it'd be rather helpful to have jelte's patch to
> increase NOFILE in addition though.

Agreed.

> > > > > +static void
> > > > > +maybe_adjust_io_workers(void)
> > > >
> > > > This also restarts workers that exit, so perhaps name it
> > > > start_io_workers_if_missing().
> > >
> > > But it also stops IO workers if necessary?
> >
> > Good point. Maybe just add a comment like "start or stop IO workers to close
> > the gap between the running count and the configured count intent".
>
> It's now
> /*
> * Start or stop IO workers, to close the gap between the number of running
> * workers and the number of configured workers. Used to respond to change of
> * the io_workers GUC (by increasing and decreasing the number of workers), as
> * well as workers terminating in response to errors (by starting
> * "replacement" workers).
> */

Excellent.

> > > > > +{
> > > > ...
> > > > > + /* Try to launch one. */
> > > > > + child = StartChildProcess(B_IO_WORKER);
> > > > > + if (child != NULL)
> > > > > + {
> > > > > + io_worker_children[id] = child;
> > > > > + ++io_worker_count;
> > > > > + }
> > > > > + else
> > > > > + break; /* XXX try again soon? */
> > > >
> > > > Can LaunchMissingBackgroundProcesses() become the sole caller of this
> > > > function, replacing the current mix of callers? That would be more conducive
> > > > to promptly doing the right thing after launch failure.
> > >
> > > I'm not sure that'd be a good idea - right now IO workers are started before
> > > the startup process, as the startup process might need to perform IO. If we
> > > started it only later in ServerLoop() we'd potentially do a fair bit of work,
> > > including starting checkpointer, bgwriter, bgworkers before we started IO
> > > workers. That shouldn't actively break anything, but it would likely make
> > > things slower.
> >
> > I missed that. How about keeping the two calls associated with PM_STARTUP but
> > replacing the assign_io_workers() and process_pm_child_exit() calls with one
> > in LaunchMissingBackgroundProcesses()?
>
> I think replacing the call in assign_io_workers() is a good idea, that way we
> don't need assign_io_workers().
>
> Less convinced it's a good idea to do the same for process_pm_child_exit() -
> if IO workers errored out we'll launch backends etc before we get to
> LaunchMissingBackgroundProcesses(). That's not a fundamental problem, but
> seems a bit odd.

Works for me.

> I think LaunchMissingBackgroundProcesses() should be split into one that
> starts aux processes and one that starts bgworkers. The one maintaining aux
> processes should be called before we start backends, the latter not.

That makes sense, though I've not thought about it much.

> > > > > + /*
> > > > > + * It's very unlikely, but possible, that reopen fails. E.g. due
> > > > > + * to memory allocations failing or file permissions changing or
> > > > > + * such. In that case we need to fail the IO.
> > > > > + *
> > > > > + * There's not really a good errno we can report here.
> > > > > + */
> > > > > + error_errno = ENOENT;
> > > >
> > > > Agreed there's not a good errno, but let's use a fake errno that we're mighty
> > > > unlikely to confuse with an actual case of libc returning that errno. Like
> > > > one of EBADF or EOWNERDEAD.
> > >
> > > Can we rely on that to be present on all platforms, including windows?
> >
> > I expect EBADF is universal. EBADF would be fine.
>
> Hm, that's actually an error that could happen for other reasons, and IMO
> would be more confusing than ENOENT. The latter describes the issue to a
> reasonable extent, EBADFD seems like it would be more confusing.
>
> I'm not sure it's worth investing time in this - it really shouldn't happen,
> and we probably have bigger problems than the error code if it does. But if we
> do want to do something, I think I can see a way to report a dedicated error
> message for this.

I agree it's not worth much investment. Let's leave that one as-is. We can
always change it further if the not-really-good errno shows up too much.

> > https://github.com/coreutils/gnulib/blob/master/doc/posix-headers/errno.texi
> > lists some OSs not having it, the newest of which looks like NetBSD 9.3
> > (2022). We could use it and add a #define for platforms lacking it.
>
> What would we define it as? I guess we could just pick a high value, but...

Some second-best value, but I withdraw that idea.

On Wed, Mar 12, 2025 at 07:23:47PM -0400, Andres Freund wrote:
> Attached is v2.7, with the following changes:

> Unresolved:
>
> - Whether to continue starting new workers in process_pm_child_exit()

I'm fine with that continuing. It's hurting ~nothing.

> - What to name the view (currently pg_aios). I'm inclined to go for
> pg_io_handles right now.

I like pg_aios mildly better than pg_io_handles, since "handle" sounds
implementation-centric.

On Fri, Mar 14, 2025 at 03:43:15PM -0400, Andres Freund wrote:
> Attached is v2.8 with the following changes:

> - In parallel: Find a way to deal with the set_max_safe_fds() issue that we've
> been discussing on this thread recently. As that only affects io_uring, it
> doesn't have to block other patches going in.

As above, I like the "redefine" option.

> - Right now effective_io_concurrency cannot be set > 0 on Windows and other
> platforms that lack posix_fadvise. But with AIO we can read ahead without
> posix_fadvise().
>
> It'd not really make anything worse than today to not remove the limit, but
> it'd be pretty weird to prevent windows etc from benefiting from AIO. Need
> to look around and see whether it would require anything other than doc
> changes.

Worth changing, but non-blocking.

On Fri, Mar 14, 2025 at 03:58:43PM -0400, Andres Freund wrote:
> - Should the docs for debug_io_direct be rephrased and if so, how?

> Perhaps it's worth going from
>
> <para>
> Currently this feature reduces performance, and is intended for
> developer testing only.
> </para>
> to
> <para>
> Currently this feature reduces performance in many workloads, and is
> intended for testing only.
> </para>
>
> I.e. qualify the downside with "many workloads" and widen the audience ever so
> slightly?

Yes, that's good.

Other than the smgr patch review sent on its own thread, I've not yet reviewed
any of these patches comprehensively. Given the speed of change, I felt it
was time to flush comments buffered since 2025-03-11:

commit 0284401 wrote:
> aio: Basic subsystem initialization

> @@ -465,6 +466,7 @@ AutoVacLauncherMain(const void *startup_data, size_t startup_data_len)
> */
> LWLockReleaseAll();
> pgstat_report_wait_end();
> + pgaio_error_cleanup();

AutoVacLauncherMain(), BackgroundWriterMain(), CheckpointerMain(), and
WalWriterMain() call AtEOXact_Buffers() but not AtEOXact_Aio(). Is that
proper? They do call pgaio_error_cleanup() as seen here, so the only loss is
some asserts. (The load-bearing part does get done.)

commit da72269 wrote:
> aio: Add core asynchronous I/O infrastructure

> + * This could be in aio_internal.h, as it is not pubicly referenced, but

typo -> publicly

commit 55b454d wrote:
> aio: Infrastructure for io_method=worker

> + /* Try to launch one. */
> + child = StartChildProcess(B_IO_WORKER);
> + if (child != NULL)
> + {
> + io_worker_children[id] = child;
> + ++io_worker_count;
> + }
> + else
> + break; /* XXX try again soon? */

I'd change the comment to something like one of:

retry after DetermineSleepTime()
next LaunchMissingBackgroundProcesses() will retry in <60s

On Tue, Mar 18, 2025 at 04:12:18PM -0400, Andres Freund wrote:
> - Decide what to do about the smgr interrupt issue

Replied on that thread. It's essentially ready.

> Questions / Unresolved:
>
> - Write support isn't going to land in 18, but there is a tiny bit of code
> regarding writes in the code for bufmgr IO. I guess I could move that to a
> later commit?
>
> I'm inclined to leave it, the structure of the code only really makes
> knowing that it's going to be shared between reads & writes.

Fine to leave it.

> - pg_aios view name

Covered above.

> Subject: [PATCH v2.10 08/28] bufmgr: Implement AIO read support

Some comments about BM_IO_IN_PROGRESS may need updates. This paragraph:

* The BM_IO_IN_PROGRESS flag acts as a kind of lock, used to wait for I/O on a
buffer to complete (and in releases before 14, it was accompanied by a
per-buffer LWLock). The process doing a read or write sets the flag for the
duration, and processes that need to wait for it to be cleared sleep on a
condition variable.

And these individual lines from "git grep BM_IO_IN_PROGRESS":
* I/O already in progress. We already hold BM_IO_IN_PROGRESS for the
* only one process at a time can set the BM_IO_IN_PROGRESS bit.
* only one process at a time can set the BM_IO_IN_PROGRESS bit.
* i.e at most one BM_IO_IN_PROGRESS bit is set per proc.

The last especially. For the other three lines and the paragraph, the notion
of a process "holding" BM_IO_IN_PROGRESS or being the process to "set" it or
being the process "doing a read" becomes less significant when one process
starts the IO and another completes it.

> + /* we better have ensured the buffer is present until now */
> + Assert(BUF_STATE_GET_REFCOUNT(buf_state) >= 1);

I'd delete that comment; to me, the assertion alone is clearer.

> + ereport(LOG,
> + (errcode(ERRCODE_DATA_CORRUPTED),
> + errmsg("invalid page in block %u of relation %s; zeroing out page",

This is changing level s/WARNING/LOG/. That seems orthogonal to the patch's
goals; is it needed? If so, I recommend splitting it out as a preliminary
patch, to highlight the behavior change for release notes.

> +/*
> + * Perform completion handling of a single AIO read. This read may cover
> + * multiple blocks / buffers.
> + *
> + * Shared between shared and local buffers, to reduce code duplication.
> + */
> +static pg_attribute_always_inline PgAioResult
> +buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
> + uint8 cb_data, bool is_temp)
> +{
> + PgAioResult result = prior_result;
> + PgAioTargetData *td = pgaio_io_get_target_data(ioh);
> + uint64 *io_data;
> + uint8 handle_data_len;
> +
> + if (is_temp)
> + {
> + Assert(td->smgr.is_temp);
> + Assert(pgaio_io_get_owner(ioh) == MyProcNumber);
> + }
> + else
> + Assert(!td->smgr.is_temp);
> +
> + /*
> + * Iterate over all the buffers affected by this IO and call appropriate
> + * per-buffer completion function for each buffer.
> + */
> + io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
> + for (uint8 buf_off = 0; buf_off < handle_data_len; buf_off++)
> + {
> + Buffer buf = io_data[buf_off];
> + PgAioResult buf_result;
> + bool failed;
> +
> + Assert(BufferIsValid(buf));
> +
> + /*
> + * If the entire failed on a lower-level, each buffer needs to be

Missing word, probably fix like:
s,entire failed on a lower-level,entire I/O failed on a lower level,

> + * marked as failed. In case of a partial read, some buffers may be
> + * ok.
> + */
> + failed =
> + prior_result.status == ARS_ERROR
> + || prior_result.result <= buf_off;

I didn't run an experiment to check the following, but I think this should be
s/<=/</. Suppose we requested two blocks and read some amount of bytes
[1*BLCKSZ, 2*BLSCKSZ - 1]. md_readv_complete will store result=1. buf_off==0
should compute failed=false here, but buf_off==1 should compute failed=true.

I see this relies on md_readv_complete having converted "result" to blocks.
Was there some win from doing that as opposed to doing the division here?
Division here ("blocks_read = prior_result.result / BLCKSZ") would feel easier
to follow, to me.

> +
> + buf_result = buffer_readv_complete_one(buf_off, buf, cb_data, failed,
> + is_temp);
> +
> + /*
> + * If there wasn't any prior error and the IO for this page failed in
> + * some form, set the whole IO's to the page's result.

s/the IO for this page/page verification/
s/IO's/IO's result/

> + */
> + if (result.status != ARS_ERROR && buf_result.status != ARS_OK)
> + {
> + result = buf_result;
> + pgaio_result_report(result, td, LOG);
> + }
> + }
> +
> + return result;
> +}

In response to

Re: AIO v2.5 at 2025-03-18 20:12:18 from Andres Freund

Responses

Re: AIO v2.5 at 2025-03-19 22:17:37 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Nathan Bossart	2025-03-19 21:28:23	Re: optimize file transfer in pg_upgrade
Previous Message	David G. Johnston	2025-03-19 21:13:17	Doc: Fixup misplaced filelist.sgml entities and add some commentary