Quick Links

Re: AIO v2.5

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
Subject:	Re: AIO v2.5
Date:	2025-03-12 03:57:43
Message-ID:	20250312035743.f5.nmisch@google.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, Mar 11, 2025 at 07:55:35PM -0400, Andres Freund wrote:
> On 2025-03-11 12:41:08 -0700, Noah Misch wrote:
> > On Mon, Sep 16, 2024 at 01:51:42PM -0400, Andres Freund wrote:
> > > On 2024-09-16 07:43:49 -0700, Noah Misch wrote:

> What do we want to do for ConditionalLockBufferForCleanup() (I don't think
> IsBufferCleanupOK() can matter)? I suspect we should also make it wait for
> the IO. See below:

I agree IsBufferCleanupOK() can't matter. It asserts that the caller already
holds the exclusive buffer lock, and it doesn't loop or wait.

> [...] leads me to think that causing
> the IO to complete is probably the safest bet. Triggering IO completion never
> requires acquiring new locks that could participate in a deadlock, so it'd be
> safe.

Yes, that decision looks right to me. I scanned the callers, and none of them
clearly prefers a different choice. If we someday find one caller prefers a
false return over blocking on I/O completion, we can always introduce a new
ConditionalLock* variant for that.

> > > - To allow io_workers to be PGC_SIGHUP, and to eventually allow to
> > > automatically in/decrease active workers, the max number of workers (32) is
> > > always allocated. That means we use more semaphores than before. I think
> > > that's ok, it's not 1995 anymore. Alternatively we can add a
> > > "io_workers_max" GUC and probe for it in initdb.
> >
> > Let's start as you have it. If someone wants to make things perfect for
> > non-root BSD users, they can add the GUC later. io_method=sync is a
> > sufficient backup plan indefinitely.
>
> Cool.
>
> I think we'll really need to do something about this for BSD users regardless
> of AIO. Or maybe those OSs should fix something, but somehow I am not having
> high hopes for an OS that claims to have POSIX confirming unnamed semaphores
> due to having a syscall that always returns EPERM... [1].

I won't mind a project making things better for non-root BSD users. I do
think such a project should not block other projects making things better for
everything else (like $SUBJECT).

> > > - Check if documentation for track_io_timing needs to be adjusted, after the
> > > bufmgr.c changes we only track waiting for an IO.
> >
> > Yes.
>
> The relevant sentences seem to be:
>
> - "Enables timing of database I/O calls."
>
> s/calls/waits/
>
> - "Time spent in {read,write,writeback,extend,fsync} operations"
>
> s/in/waiting for/
>
> Even though not all of these will use AIO, the "waiting for" formulation
> seems just as accurate.
>
> - "Columns tracking I/O time will only be non-zero when <xref
> linkend="guc-track-io-timing"/> is enabled."
>
> s/time/wait time/

Sounds good.

> > On Mon, Mar 10, 2025 at 02:23:12PM -0400, Andres Freund wrote:
> > > Attached is v2.6 of the AIO patchset.
> >
> > > - 0005, 0006 - io_uring support - close, but we need to do something about
> > > set_max_fds(), which errors out spuriously in some cases
> >
> > What do we know about those cases? I don't see a set_max_fds(); is that
> > set_max_safe_fds(), or something else?
>
> Sorry, yes, set_max_safe_fds(). The problem basically is that with io_uring we
> will have a large number of FDs already allocated by the time
> set_max_safe_fds() is called. set_max_safe_fds() subtracts already_open from
> max_files_per_process allowing few, and even negative, IOs.
>
> I think we should redefine max_files_per_process to be about the number of
> files each *backend* will additionally open. Jelte was working on related
> patches, see [2]

Got it. max_files_per_process is a quaint setting, documented as follows (I
needed the reminder):

If the kernel is enforcing
a safe per-process limit, you don't need to worry about this setting.
But on some platforms (notably, most BSD systems), the kernel will
allow individual processes to open many more files than the system
can actually support if many processes all try to open
that many files. If you find yourself seeing <quote>Too many open
files</quote> failures, try reducing this setting.

I could live with
v6-0003-Reflect-the-value-of-max_safe_fds-in-max_files_pe.patch but would lean
against it since it feels unduly novel to have a setting where we use the
postgresql.conf value to calculate a value that becomes the new SHOW-value of
the same setting. Options I'd consider before that:

- Like you say, "redefine max_files_per_process to be about the number of
files each *backend* will additionally open". It will become normal that
each backend's actual FD list length is max_files_per_process + MaxBackends
if io_method=io_uring. Outcome is not unlike
v6-0002-Bump-postmaster-soft-open-file-limit-RLIMIT_NOFIL.patch +
v6-0003-Reflect-the-value-of-max_safe_fds-in-max_files_pe.patch but we don't
mutate max_files_per_process. Benchmark results should not change beyond
the inter-major-version noise level unless one sets io_method=io_uring. I'm
feeling best about this one, but I've not been thinking about it long.

- When building with io_uring, make the max_files_per_process default
something like 10000 instead. Disadvantages: FD usage grows even if you
don't use io_uring. Merely rebuilding with io_uring (not enabling it at
runtime) will change benchmark results. High MaxBackends still needs to
override the value.

> > > +static void
> > > +maybe_adjust_io_workers(void)
> >
> > This also restarts workers that exit, so perhaps name it
> > start_io_workers_if_missing().
>
> But it also stops IO workers if necessary?

Good point. Maybe just add a comment like "start or stop IO workers to close
the gap between the running count and the configured count intent".

> > > +{
> > ...
> > > + /* Try to launch one. */
> > > + child = StartChildProcess(B_IO_WORKER);
> > > + if (child != NULL)
> > > + {
> > > + io_worker_children[id] = child;
> > > + ++io_worker_count;
> > > + }
> > > + else
> > > + break; /* XXX try again soon? */
> >
> > Can LaunchMissingBackgroundProcesses() become the sole caller of this
> > function, replacing the current mix of callers? That would be more conducive
> > to promptly doing the right thing after launch failure.
>
> I'm not sure that'd be a good idea - right now IO workers are started before
> the startup process, as the startup process might need to perform IO. If we
> started it only later in ServerLoop() we'd potentially do a fair bit of work,
> including starting checkpointer, bgwriter, bgworkers before we started IO
> workers. That shouldn't actively break anything, but it would likely make
> things slower.

I missed that. How about keeping the two calls associated with PM_STARTUP but
replacing the assign_io_workers() and process_pm_child_exit() calls with one
in LaunchMissingBackgroundProcesses()? In the event of a launch failure, I
think that would retry the launch quickly, as opposed to maybe-never.

> I rather dislike the code around when we start what. Leaving AIO aside, during
> a normal startup we start checkpointer, bgwriter before the startup
> process. But during a crash restart we don't explicitly start them. Why make
> things uniform when it coul also be exciting :)

It's become some artisanal code! :)

> > > + /*
> > > + * It's very unlikely, but possible, that reopen fails. E.g. due
> > > + * to memory allocations failing or file permissions changing or
> > > + * such. In that case we need to fail the IO.
> > > + *
> > > + * There's not really a good errno we can report here.
> > > + */
> > > + error_errno = ENOENT;
> >
> > Agreed there's not a good errno, but let's use a fake errno that we're mighty
> > unlikely to confuse with an actual case of libc returning that errno. Like
> > one of EBADF or EOWNERDEAD.
>
> Can we rely on that to be present on all platforms, including windows?

I expect EBADF is universal. EBADF would be fine.

EOWNERDEAD is from 2006.
https://learn.microsoft.com/en-us/cpp/c-runtime-library/errno-constants?view=msvc-140
says VS2015 had EOWNERDEAD (the page doesn't have links for older Visual
Studio versions, so I consider them unknown).
https://github.com/coreutils/gnulib/blob/master/doc/posix-headers/errno.texi
lists some OSs not having it, the newest of which looks like NetBSD 9.3
(2022). We could use it and add a #define for platforms lacking it.

> > > + ret = io_uring_submit(uring_instance);
> > > + pgstat_report_wait_end();
> > > +
> > > + if (ret == -EINTR)
> > > + {
> > > + pgaio_debug(DEBUG3,
> > > + "aio method uring: submit EINTR, nios: %d",
> > > + num_staged_ios);
> > > + }
> > > + else if (ret < 0)
> > > + elog(PANIC, "failed: %d/%s",
> > > + ret, strerror(-ret));
> >
> > I still think (see 2024-09-16 review) EAGAIN should do the documented
> > recommendation instead of PANIC:
> >
> > EAGAIN The kernel was unable to allocate memory for the request, or
> > otherwise ran out of resources to handle it. The application should wait for
> > some completions and try again.
>
> I don't think this can be hit in a recoverable way. We'd likely just end up
> with an untested path that quite possibly would be wrong.
>
> What wait time would be appropriate? What problems would it cause if we just
> slept while holding critical lwlocks? I think it'd typically just delay the
> crash-restart if we did, making it harder to recover from the problem.

I might use 30s like pgwin32_open_handle(), but 30s wouldn't be principled.

> Because we are careful to limit how many outstanding IO requests there are on
> an io_uring instance, the kernel has to have run *severely* out of memory to
> hit this.
>
> I suspect it might currently be *impossible* to hit this due to ENOMEM,
> because io_uring will fall back to allocating individual request, if the batch
> allocation it normally does, fails. My understanding is that for small
> allocations the kernel will try to reclaim memory forever, only large ones can
> fail.
>
> Even if it were possible to hit, the likelihood that postgres can continue to
> work ok if the kernel can't allocate ~250 bytes seems very low.
>
> How about adding a dedicated error message for EAGAIN? IMO io_uring_enter()'s
> meaning of EAGAIN is, uhm, unconvential, so a better error message than
> strerror() might be good?

I'm fine with the present error message.

> Proposed comment:
> /*
> * The io_uring_enter() manpage suggests that the appropriate
> * reaction to EAGAIN is:
> *
> * "The application should wait for some completions and try
> * again"
> *
> * However, it seems unlikely that that would help in our case, as
> * we apply a low limit to the number of outstanding IOs and thus
> * also outstanding completions, making it unlikely that we'd get
> * EAGAIN while the OS is in good working order.
> *
> * Additionally, it would be problematic to just wait here, our
> * caller might hold critical locks. It'd possibly lead to
> * delaying the crash-restart that seems likely to occur when the
> * kernel is under such heavy memory pressure.
> */

That works for me. No retry needed, then.

> > > + pgstat_report_wait_end();
> > > +
> > > + if (ret == -EINTR)
> > > + {
> > > + continue;
> > > + }
> > > + else if (ret != 0)
> > > + {
> > > + elog(PANIC, "unexpected: %d/%s: %m", ret, strerror(-ret));
> >
> > I think errno isn't meaningful here, so %m doesn't belong.
>
> You're right. I wonder if we should make errno meaningful though (by setting
> it), the elog.c machinery captures it and I know that there are logging hooks
> that utilize that fact. That'd also avoid the need to use strerror() here.

That's better still.

> > > --- a/doc/src/sgml/config.sgml
> > > +++ b/doc/src/sgml/config.sgml
> > > @@ -2687,6 +2687,12 @@ include_dir 'conf.d'
> > > <literal>worker</literal> (execute asynchronous I/O using worker processes)
> > > </para>
> > > </listitem>
> > > + <listitem>
> > > + <para>
> > > + <literal>io_uring</literal> (execute asynchronous I/O using
> > > + io_uring, if available)
> > > + </para>
> > > + </listitem>
> >
> > Docs should eventually cover RLIMIT_MEMLOCK per
> > https://github.com/axboe/liburing "ulimit settings".
>
> The way we currently use io_uring (i.e. no registered buffers), the
> RLIMIT_MEMLOCK advice only applies to linux <= 5.11. I'm not sure that's
> worth documenting?

Kernel 5.11 will be 5.5 years old by the time v18 is out. Yeah, no need for
doc coverage of that.

> > One later-patch item:
> >
> > > +static PgAioResult
> > > +SharedBufferCompleteRead(int buf_off, Buffer buffer, uint8 flags, bool failed)
> > > +{
> > ...
> > > + TRACE_POSTGRESQL_BUFFER_READ_DONE(tag.forkNum,
> > > + tag.blockNum,
> > > + tag.spcOid,
> > > + tag.dbOid,
> > > + tag.relNumber,
> > > + INVALID_PROC_NUMBER,
> > > + false);
> >
> > I wondered about whether the buffer-read-done probe should happen in the
> > process that calls the complete_shared callback or in the process that did the
> > buffer-read-start probe.
>
> Yea, that's a good point. I should at least have added a comment pointing out
> that it's a choice with pros and cons.
>
> The reason I went for doing it in the completion callback is that it seemed
> better to get the READ_DONE event as soon as possible, even if the issuer of
> the IO is currently busy doing other things. The shared completion callback is
> after all where the buffer state is updated for shared buffers.
>
> But I think you have a point too.
>
>
> > When I see dtrace examples, they usually involve explicitly naming each PID
> > to trace
>
> TBH, i've only ever used our tracepoints via perf and bpftrace, not dtrace
> itself. For those it's easy to trace more than just a single pid and to
> monitor system-wide. I don't really know enough about using dtrace itself.

Perhaps just a comment, then.

> > Assuming that's indeed the norm, I think the local callback would
> > be the better place, so a given trace contains both probes.
>
> Seems like a shame to add an extra indirect function call

Yep.

> This was an awesome review, thanks!

Glad it helped.

> [1] https://man.openbsd.org/sem_init.3#STANDARDS
> [2] https://postgr.es/m/D80MHNSG4EET.6MSV5G9P130F%40jeltef.nl

In response to

Re: AIO v2.5 at 2025-03-11 23:55:35 from Andres Freund

Responses

Re: AIO v2.5 at 2025-03-12 17:06:03 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Jeff Davis	2025-03-12 04:05:13	Re: Optimization for lower(), upper(), casefold() functions.
Previous Message	Amit Kapila	2025-03-12 03:52:55	Re: Adding a '--clean-publisher-objects' option to 'pg_createsubscriber' utility.