Re: AIO v2.0

From: Noah Misch <noah(at)leadboat(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, 陈宗志 <baotiao(at)gmail(dot)com>
Subject: Re: AIO v2.0
Date: 2024-09-17 18:08:19
Message-ID: 20240917180819.e1.nmisch@google.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Sep 16, 2024 at 01:51:42PM -0400, Andres Freund wrote:
> On 2024-09-16 07:43:49 -0700, Noah Misch wrote:
> > On Fri, Sep 06, 2024 at 03:38:16PM -0400, Andres Freund wrote:

> > Reattaching descriptors and memory in each child may work, or one could just
> > block io_method=io_uring under EXEC_BACKEND.
>
> I think the latter option is saner

Works for me.

> > > +pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
> > > +{
>
> > > + if (ret == -EINTR)
> > > + {
> > > + elog(DEBUG3, "submit EINTR, nios: %d", num_staged_ios);
> > > + continue;
> > > + }
> >
> > Since io_uring_submit() is a wrapper around io_uring_enter(), this should also
> > retry on EAGAIN. "man io_uring_enter" has:
> >
> > EAGAIN The kernel was unable to allocate memory for the request, or
> > otherwise ran out of resources to handle it. The application should wait
> > for some completions and try again.
>
> Hm. I'm not sure that makes sense. We only allow a limited number of IOs to be
> in flight for each uring instance. That's different to a use of uring to
> e.g. wait for incoming network data on thousands of sockets, where you could
> have essentially unbounded amount of requests outstanding.
>
> What would we wait for? What if we were holding a critical lock in that
> moment? Would it be safe to just block for some completions? What if there's
> actually no IO in progress?

I'd try the following. First, scan for all IOs of all processes at
AHS_DEFINED and later, advancing them to AHS_COMPLETED_SHARED. This might be
unsafe today, but discovering why it's unsafe likely will inform design beyond
EAGAIN returns. I don't specifically know of a way it's unsafe. Do just one
pass of that; there may be newer IOs in progress afterward. If submit still
gets EAGAIN, sleep a bit and retry. Like we do in pgwin32_open_handle(), fail
after a fixed number of iterations. This isn't great if we hold a critical
lock, but it beats the alternative of PANIC on the first EAGAIN.

> > > +FileStartWriteV(struct PgAioHandle *ioh, File file,
> > > + int iovcnt, off_t offset,
> > > + uint32 wait_event_info)
> > > +{
> > > ...
> >
> > FileStartWriteV() gets to state AHS_PREPARED, so let's align with the state
> > name by calling it FilePrepareWriteV (or FileWriteVPrepare or whatever).
>
> Hm - that doesn't necessarily seem right to me. I don't think the caller
> should assume that the IO will just be prepared and not already completed by
> the time FileStartWriteV() returns - we might actually do the IO
> synchronously.

Yes. Even if it doesn't become synchronous IO, some other process may advance
the IO to AHS_COMPLETED_SHARED by the next wake-up of the process that defined
the IO. Still, I think this shouldn't use the term "Start" while no state
name uses that term. What else could remove that mismatch?

> > Is there any specific decision you'd like to settle before patch 6 exits
> > WIP?
>
> Patch 6 specifically? That I really mainly kept separate for review - it

No. I'll rephrase as "Is there any specific decision you'd like to settle
before the next cohort of patches exits WIP?"

> doesn't seem particularly interesting to commit it earlier than 7, or do you
> think differently?

No, I agree a lone commit of 6 isn't a win. Roughly, the eight patches
6-9,12-15 could be a minimal attractive unit. I've not thought through that
grouping much.

> In case you mean 6+7 or 6 to ~11, I can think of the following:
>
> - I am worried about the need for bounce buffers for writes of checksummed
> buffers. That quickly ends up being a significant chunk of memory,
> particularly when using a small shared_buffers with a higher than default
> number of connection. I'm currently hacking up a prototype that'd prevent us
> from setting hint bits with just a share lock. I'm planning to start a
> separate thread about that.

AioChooseBounceBuffers() limits usage to 256 blocks (2MB) per MaxBackends.
Doing better is nice, but I don't consider this a blocker. I recommend
dealing with the worry by reducing the limit initially (128 blocks?). Can
always raise it later.

> - The header split doesn't yet quite seem right yet

I won't have a strong opinion on that one. The aio.c/aio_io.c split did catch
my attention. I made a note to check it again once those files get header
comments.

> - I'd like to implement retries in the later patches, to make sure that it
> doesn't have design implications

Yes, that's a blocker to me.

> - Worker mode needs to be able to automatically adjust the number of running
> workers, I think - otherwise it's going to be too hard to tune.

Changing that later wouldn't affect much else, so I'd not consider it a
blocker. (The worst case is that we think the initial AIO release will be a
loss for most users, so we wrap it in debug_ terminology like we did for
debug_io_direct. I'm not saying worker scaling will push AIO from one side of
that line to another, but that's why I'm fine with commits that omit
self-contained optimizations.)

> - I think the PgAioHandles need to be slimmed down a bit - there's some design
> evolution visible that should not end up in the tree.

Okay.

> - I'm not sure that I like name "subject" for the different things AIO is
> performed for

How about one of these six terms:

- listener, observer [if you view smgr as an observer of IOs in the sense of https://en.wikipedia.org/wiki/Observer_pattern]
- class, subclass, type, tag [if you view an SmgrIO as a subclass of an IO, in the object-oriented sense]

> - I am wondering if the need for pgaio_io_set_io_data_32() (to store the set
> of buffer ids that are affected by one IO) could be replaced by repurposing
> BufferDesc->freeNext or something along those lines. I don't like the amount
> of memory required for storing those arrays, even if it's not that much
> compared to needing space to store struct iovec[PG_IOV_MAX] for each AIO
> handle.

Here too, changing that later wouldn't affect much else, so I'd not consider
it a blocker.

> - I'd like to extend the test module to actually test more cases, it's too
> hard to reach some paths, particularly without [a lot] of users yet. That's
> not strictly a dependency of the earlier patches - since the initial patches
> can't actually do much in the way of IO.

Agreed. Among the post-patch check-world coverage, which uncovered parts have
the most risk?

> - We shouldn't reserve AioHandles etc for io workers - but because different
> tpes of aux processes don't use a predetermined ProcNumber, that's not
> entirely trivial without adding more complexity. I've actually wondered
> whether IO workes should be their own "top-level" kind of process, rather
> than an aux process. But that seems quite costly.

Here too, changing that later wouldn't affect much else, so I'd not consider
it a blocker. Of these ones I'm calling non-blockers, which would you most
regret deferring?

> - Right now the io_uring mode has each backend's io_uring instance visible to
> each other process. That ends up using a fair number of FDs. That's OK from
> an efficiency perspective, but I think we'd need to add code to adjust the
> soft RLIMIT_NOFILE (it's set to 1024 on most distros because there are
> various programs that iterate over all possible FDs, causing significant

Agreed on raising the soft limit. Docs and/or errhint() likely will need to
mention system configuration nonetheless, since some users will encounter
RLIMIT_MEMLOCK or /proc/sys/kernel/io_uring_disabled.

> slowdowns when the soft limit defaults to something high). I earlier had a
> limited number of io_uring instances, but that added a fair amount of
> overhead because then submitting IO would require a lock.
>
> That again doesn't have to be solved as part of the earlier patches but
> might have some minor design impact.

How far do you see the design impact spreading on that one?

Thanks,
nm

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2024-09-17 18:23:44 Re: Using per-transaction memory contexts for storing decoded tuples
Previous Message Masahiko Sawada 2024-09-17 17:54:05 Re: Conflict detection for update_deleted in logical replication