Re: AIO v2.0

From: Andres Freund <andres(at)anarazel(dot)de>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, 陈宗志 <baotiao(at)gmail(dot)com>
Subject: Re: AIO v2.0
Date: 2024-09-16 17:51:42
Message-ID: 237y5rabqim2c2v37js53li6i34v2525y2baf32isyexecn4ic@bqmlx5mrnwuf
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Thanks for the review!

On 2024-09-16 07:43:49 -0700, Noah Misch wrote:
> On Fri, Sep 06, 2024 at 03:38:16PM -0400, Andres Freund wrote:
> > There's plenty more to do, but I thought this would be a useful checkpoint.
>
> I find patches 1-5 are Ready for Committer.

Cool!

> > +typedef enum PgAioHandleState
>
> This enum clarified a lot for me, so I wish I had read it before anything
> else. I recommend referring to it in README.md.

Makes sense.

> Would you also cover the valid state transitions and which of them any
> backend can do vs. which are specific to the defining backend?

Yea, we should. I earlier had something, but because details were still
changing it was hard to keep up2date.

> > +{
> > + /* not in use */
> > + AHS_IDLE = 0,
> > +
> > + /* returned by pgaio_io_get() */
> > + AHS_HANDED_OUT,
> > +
> > + /* pgaio_io_start_*() has been called, but IO hasn't been submitted yet */
> > + AHS_DEFINED,
> > +
> > + /* subjects prepare() callback has been called */
> > + AHS_PREPARED,
> > +
> > + /* IO is being executed */
> > + AHS_IN_FLIGHT,
>
> Let's align terms between functions and states those functions reach. For
> example, I recommend calling this state AHS_SUBMITTED, because
> pgaio_io_prepare_submit() is the function reaching this state.
> (Alternatively, use in_flight in the function name.)

There used to be a separate SUBMITTED, but I removed it at some point as not
necessary anymore. Arguably it might be useful to re-introduce it so that
e.g. with worker mode one can tell the difference between the IO being queued
and the IO actually being processed.

> > +void
> > +pgaio_io_ref_wait(PgAioHandleRef *ior)
> > +{
> > + uint64 ref_generation;
> > + PgAioHandleState state;
> > + bool am_owner;
> > + PgAioHandle *ioh;
> > +
> > + ioh = pgaio_io_from_ref(ior, &ref_generation);
> > +
> > + am_owner = ioh->owner_procno == MyProcNumber;
> > +
> > +
> > + if (pgaio_io_was_recycled(ioh, ref_generation, &state))
> > + return;
> > +
> > + if (am_owner)
> > + {
> > + if (state == AHS_DEFINED || state == AHS_PREPARED)
> > + {
> > + /* XXX: Arguably this should be prevented by callers? */
> > + pgaio_submit_staged();
>
> Agreed for AHS_DEFINED, if not both. AHS_DEFINED here would suggest a past
> longjmp out of pgaio_io_prepare() w/o a subxact rollback to cleanup.

That, or not having submitted the IO. One thing I've been thinking about as
being potentially helpful infrastructure is to have something similar to a
critical section, except that it asserts that one is not allowed to block or
forget submitting staged IOs.

> > +void
> > +pgaio_io_prepare(PgAioHandle *ioh, PgAioOp op)
> > +{
> > + Assert(ioh->state == AHS_HANDED_OUT);
> > + Assert(pgaio_io_has_subject(ioh));
> > +
> > + ioh->op = op;
> > + ioh->state = AHS_DEFINED;
> > + ioh->result = 0;
> > +
> > + /* allow a new IO to be staged */
> > + my_aio->handed_out_io = NULL;
> > +
> > + pgaio_io_prepare_subject(ioh);
> > +
> > + ioh->state = AHS_PREPARED;
>
> As defense in depth, let's add a critical section from before assigning
> AHS_DEFINED to here. This code already needs to be safe for that (per
> README.md). When running outside a critical section, an ERROR in a subject
> callback could leak the lwlock disowned in shared_buffer_prepare_common(). I
> doubt there's a plausible way to reach that leak today, but future subject
> callbacks could add risk over time.

Makes sense.

> > +if test "$with_liburing" = yes; then
> > + PKG_CHECK_MODULES(LIBURING, liburing)
> > +fi
>
> I used the attached makefile patch to build w/ liburing.

Thanks, will incorporate.

> With EXEC_BACKEND, "make check PG_TEST_INITDB_EXTRA_OPTS=-cio_method=io_uring"
> fails early:

Right - that's to be expected.

> 2024-09-15 12:46:08.168 PDT postmaster[2069397] LOG: starting PostgreSQL 18devel on x86_64-pc-linux-gnu, compiled by gcc (Debian 13.2.0-13) 13.2.0, 64-bit
> 2024-09-15 12:46:08.168 PDT postmaster[2069397] LOG: listening on Unix socket "/tmp/pg_regress-xgQOPH/.s.PGSQL.65312"
> 2024-09-15 12:46:08.203 PDT startup[2069423] LOG: database system was shut down at 2024-09-15 12:46:07 PDT
> 2024-09-15 12:46:08.209 PDT client backend[2069425] [unknown] FATAL: the database system is starting up
> 2024-09-15 12:46:08.222 PDT postmaster[2069397] LOG: database system is ready to accept connections
> 2024-09-15 12:46:08.254 PDT autovacuum launcher[2069435] PANIC: failed: -9/Bad file descriptor
> 2024-09-15 12:46:08.286 PDT client backend[2069444] [unknown] PANIC: failed: -95/Operation not supported
> 2024-09-15 12:46:08.355 PDT client backend[2069455] [unknown] PANIC: unexpected: -95/Operation not supported: No such file or directory
> 2024-09-15 12:46:08.370 PDT postmaster[2069397] LOG: received fast shutdown request
>
> I expect that's from io_uring_queue_init() stashing in shared memory a file
> descriptor and mmap address, which aren't valid in EXEC_BACKEND children.
> Reattaching descriptors and memory in each child may work, or one could just
> block io_method=io_uring under EXEC_BACKEND.

I think the latter option is saner - I don't think there's anything to be
gained by supporting io_uring in this situation. It's not like anybody will
use it for real-world workloads where performance matters. Nor would it be
useful fo portability testing.

> > +pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
> > +{

> > + if (ret == -EINTR)
> > + {
> > + elog(DEBUG3, "submit EINTR, nios: %d", num_staged_ios);
> > + continue;
> > + }
>
> Since io_uring_submit() is a wrapper around io_uring_enter(), this should also
> retry on EAGAIN. "man io_uring_enter" has:
>
> EAGAIN The kernel was unable to allocate memory for the request, or
> otherwise ran out of resources to handle it. The application should wait
> for some completions and try again.

Hm. I'm not sure that makes sense. We only allow a limited number of IOs to be
in flight for each uring instance. That's different to a use of uring to
e.g. wait for incoming network data on thousands of sockets, where you could
have essentially unbounded amount of requests outstanding.

What would we wait for? What if we were holding a critical lock in that
moment? Would it be safe to just block for some completions? What if there's
actually no IO in progress?

> > +FileStartWriteV(struct PgAioHandle *ioh, File file,
> > + int iovcnt, off_t offset,
> > + uint32 wait_event_info)
> > +{
> > ...
>
> FileStartWriteV() gets to state AHS_PREPARED, so let's align with the state
> name by calling it FilePrepareWriteV (or FileWriteVPrepare or whatever).

Hm - that doesn't necessarily seem right to me. I don't think the caller
should assume that the IO will just be prepared and not already completed by
the time FileStartWriteV() returns - we might actually do the IO
synchronously.

> For non-sync IO methods, I gather it's essential that a process other than the
> IO definer be scanning for incomplete IOs and completing them.

Yep - it's something I've been fighting with / redesigning a *lot*. Earlier
the AIO subsystem could transparently retry IOs, but that ends up being a
nightmare - or at least I couldn't find a way to not make it a
nightmare. There are two main complexities:

1) What if the IO is being completed in a critical section? We can't reopen
the file in that situation. My initial fix for this was to defer retries,
but that's problematic too:

2) Acquiring an IO needs to be able to guarantee forward progress. Because
there's a limited number of IOs that means we need to be able to complete
IOs while acquiring an IO. So we can't just keep the IO handle around -
which in turn means that we'd need to save the state for retrying
somewhere. Which would require some pre-allocated memory to save that
state.

Thus I think it's actually better if we delegate retries to the callsites. I
was thinking that for partial reads of shared buffers we ought to not set
BM_IO_ERROR though...

> Otherwise, deadlocks like this would happen:

> backend1 locks blk1 for non-IO reasons
> backend2 locks blk2, starts AIO write
> backend1 waits for lock on blk2 for non-IO reasons
> backend2 waits for lock on blk1 for non-IO reasons
>
> If that's right, in worker mode, the IO worker resolves that deadlock. What
> resolves it under io_uring? Another process that happens to do
> pgaio_io_ref_wait() would dislodge things, but I didn't locate the code to
> make that happen systematically.

Yea, it's code that I haven't forward ported yet. I think basically
LockBuffer[ForCleanup] ought to call pgaio_io_ref_wait() when it can't
immediately acquire the lock and if the buffer has IO going on.

> I could share more-tactical observations about patches 6-20, but they're
> probably things you'd change without those observations.

Agreed.

> Is there any specific decision you'd like to settle before patch 6 exits
> WIP?

Patch 6 specifically? That I really mainly kept separate for review - it
doesn't seem particularly interesting to commit it earlier than 7, or do you
think differently?

In case you mean 6+7 or 6 to ~11, I can think of the following:

- I am worried about the need for bounce buffers for writes of checksummed
buffers. That quickly ends up being a significant chunk of memory,
particularly when using a small shared_buffers with a higher than default
number of connection. I'm currently hacking up a prototype that'd prevent us
from setting hint bits with just a share lock. I'm planning to start a
separate thread about that.

- The header split doesn't yet quite seem right yet

- I'd like to implement retries in the later patches, to make sure that it
doesn't have design implications

- Worker mode needs to be able to automatically adjust the number of running
workers, I think - otherwise it's going to be too hard to tune.

- I think the PgAioHandles need to be slimmed down a bit - there's some design
evolution visible that should not end up in the tree.

- I'm not sure that I like name "subject" for the different things AIO is
performed for

- I am wondering if the need for pgaio_io_set_io_data_32() (to store the set
of buffer ids that are affected by one IO) could be replaced by repurposing
BufferDesc->freeNext or something along those lines. I don't like the amount
of memory required for storing those arrays, even if it's not that much
compared to needing space to store struct iovec[PG_IOV_MAX] for each AIO
handle.

- I'd like to extend the test module to actually test more cases, it's too
hard to reach some paths, particularly without [a lot] of users yet. That's
not strictly a dependency of the earlier patches - since the initial patches
can't actually do much in the way of IO.

- We shouldn't reserve AioHandles etc for io workers - but because different
tpes of aux processes don't use a predetermined ProcNumber, that's not
entirely trivial without adding more complexity. I've actually wondered
whether IO workes should be their own "top-level" kind of process, rather
than an aux process. But that seems quite costly.

- Right now the io_uring mode has each backend's io_uring instance visible to
each other process. That ends up using a fair number of FDs. That's OK from
an efficiency perspective, but I think we'd need to add code to adjust the
soft RLIMIT_NOFILE (it's set to 1024 on most distros because there are
various programs that iterate over all possible FDs, causing significant
slowdowns when the soft limit defaults to something high). I earlier had a
limited number of io_uring instances, but that added a fair amount of
overhead because then submitting IO would require a lock.

That again doesn't have to be solved as part of the earlier patches but
might have some minor design impact.

Thanks again,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2024-09-16 17:56:09 Re: AIO v2.0
Previous Message Alvaro Herrera 2024-09-16 17:47:09 Re: not null constraints, again