Re: io_uring support

From: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: io_uring support
Date: 2019-08-23 10:14:05
Message-ID: CA+q6zcUMYkCrmq9m32iFu-cTYbhpB0XG8Gu_T_TviddvMT6ZMA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Mon, Aug 19, 2019 at 10:21 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> > For us the important part is probably that it's an asynchronious IO, that can
> > work not only with O_DIRECT, but with also with buffered access.
>
> Note that while the buffered access does allow for some acceleration, it
> currently does have quite noticable CPU overhead.

I haven't looked deep at benchmarks yet, is there any public results that show
this? So far I've seen only [1], but it doesn't say too much about CPU
overhead. Probably it could be also interesting to check io_uring-bench.

> > I've tested this patch so far only inside a qemu vm on the latest
> > io_uring branch from linux-block tree. The result is relatively
> > simple, and introduces new interface smgrqueueread, smgrsubmitread and
> > smgrwaitread to queue any read we want, then submit a queue to a
> > kernel and then wait for a result. The simplest example of how this
> > interface could be used I found in pg_prewarm for buffers prefetching.
>
> Hm. I'm bit doubtful that that's going in the direction of being the
> right interface. I think we'd basically have to insist that all AIO
> capable smgr's use one common AIO layer (note that the UNDO patches add
> another smgr implementation). Otherwise I think we'll have a very hard
> time to make them cooperate. An interface like this would also lead to
> a lot of duplicated interfaces, because we'd basically need most of the
> smgr interface functions duplicated.
>
> I suspect we'd rather have to build something where the existing
> functions grow a parameter controlling synchronizity. If AIO is allowed
> and supported, the smgr implementation would initiate the IO, together
> with a completion function for it, and return some value allowing the
> caller to wait for the result if desirable.

Agree, all AIO capable smgr's need to use some common layer. But it seems hard
to implement some async operations only via adding more parameters, e.g.
accumulating AIO operations before submitting to a kernel.

> > As a result of this experiment I have few questions, open points and requests
> > for the community experience:
> >
> > * I guess the proper implementation to use async IO is a big deal, but could
> > bring also significant performance advantages. Is there any (nearest) future
> > for such kind of async IO in PostgreSQL? Buffer prefetching is a simplest
> > example, but taking into account that io_uring supports ordering, barriers
> > and linked events, there are probably more use cases when it could be useful.
>
> The lowest hanging fruit that I can see - and which I played with - is
> making the writeback flushing use async IO. That's particularly
> interesting for bgwriter. As it commonly only performs random IO, and
> as we need to keep the number of dirty buffers in the kernel small to
> avoid huge latency spikes, being able to submit IOs asynchronously can
> yield significant benefits.

Yeah, sounds interesting. Are there any results you already can share? Maybe
it's possible to collaborate on this topic?

> > * Assuming that the answer for previous question is positive, there could be
> > different strategies how to use io_uring. So far I see different
> > opportunities for waiting. Let's say we have prepared a batch of async IO
> > operations and submitted it. Then we can e.g.
> >
> > -> just wait for a batch to be finished
> > -> wait (in the same syscall as submitting) for previously submitted batches,
> > then start submitting again, and at the end wait for the leftovers
> > -> peek if there are any events completed, and get only those without waiting
> > for the whole batch (in this case it's necessary to make sure submission
> > queue is not overflowed)
> >
> > So it's open what and when to use.
>
> I don't think there's much point in working only with complete
> batches. I think we'd loose too much of the benefit by introducing
> unnecessary synchronous operations. I think we'd need to design the
> interface in a way that there constantly can be in-progress IOs, block
> when the queue is full, and handle finished IOs using a callback
> mechanism or such.

What would happen if we suddenly don't have enough IO at this particular
moment to fill a queue? Probably there should be more triggers for blocking.

> > * How may look like a data structure, that can describe IO from PostgreSQL
> > perspective? With io_uring we need to somehow identify IO operations that
> > were completed. For now I'm just using a buffer number.
>
> In my hacks I've used the sqe's user_data to point to a struct with
> information about the IO.

Yes, that's the same approach I'm using too. I'm just not sure what exactly
should be this "struct with information about the IO", what should it contain
ideally?

> > experimental patch has many limitations, e.g. only one ring is used for
> > everything, which is of course far from ideal and makes identification even
> > more important.
>
> I think we don't want to use more than one ring. Makes it too
> complicated to have interdependencies between operations (e.g. waiting
> for fsyncs before submitting further writes). I also don't really see
> why we would benefit from more?

Since the balance between SQE and CQE can be important and there could be
different "sources of AIO" with different submission frequency, I thought I
could be handy to separate "heavy loaded" rings from common purpose rings
(especially in the case of ordered AIO).

> > * There are few more freedom dimensions, that io_uring introduces - how many
> > rings to use, how many events per ring (which is going to be n for sqe and
> > 2*n for cqe), how many IO operations per event to do (similar to
> > preadv/pwritev we can provide a vector), what would be the balance between
> > submit and complete queues. I guess it will require a lot of benchmarking to
> > find a good values for these.
>
>
> One thing you didn't mention: A lot of this also requires that we
> overhaul the way buffer locking for IOs works. Currently we really can
> only have one proper IO in progress at a time, which clearly isn't
> sufficient for anything that wants to use AIO.

Yeah, that's correct. My hopes are that this could be done in small steps, e.g.
introduce AIO only for some particular cases to see how would it work.

[1]: https://lore.kernel.org/linux-block/20190116175003(dot)17880-1-axboe(at)kernel(dot)dk/

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Asim R P 2019-08-23 10:17:51 Re: WIP/PoC for parallel backup
Previous Message Peter Eisentraut 2019-08-23 10:09:26 backward compatibility of GSSENCRequest