From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com> |
Subject: | Re: AIO v2.3 |
Date: | 2025-02-10 22:54:22 |
Message-ID: | 42wltxewfbl7uj6rehcaciythpkm2k2lt7zqfpmj76mztn42yc@hauc6p2qv6lh |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2025-02-11 11:48:38 +1300, Thomas Munro wrote:
> On Thu, Jan 23, 2025 at 5:29 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > - Heikki doesn't love pgaio_submit_staged(), suggested pgaio_kick_staged() or
> > such. I don't love that name though.
>
> Problem statement: You want to be able to batch I/O submission, ie
> make a single call to ioring_enter() (and other mechanisms) to start
> several I/Os, but the code that submits is inside StartReadBuffers()
> and the code that knows how many I/Os it wants to start now is at a
> higher level, read_stream.c and in future elsewhere. So you invented
> this flag to tell StartReadBuffers() not to call
> pgaio_submit_staged(), because you promise to do it later, via this
> staging list. Additionally, there is a kind of programming rule here
> that you *must* submit I/Os that you stage, you aren't allowed to (for
> example) stage I/Os and then sleep, so it has to be a fairly tight
> piece of code.
>
> Would the API be better like this?: When you want to create a batch
> of I/Os submitted together, you wrap the work in pgaio_begin_batch()
> and pgaio_submit_batch(), eg the loop in read_stream_lookahead().
> Then bufmgr wouldn't need this flag: when it (or anything else) calls
> smgrstartreadv(), if there is not currently an explicit batch then it
> would be submitted immediately, and otherwise it would only be staged.
> This way, batch construction (or whatever word you prefer for batch)
> is in a clearly and explicitly demarcated stretch of code in one
> lexical scope (though its effect is dynamically scoped just like the
> staging list itself because we don't want to pass explicit I/O
> contexts through the layers), but code that doesn't call those and
> reaches AsyncReadBuffer() or whatever gets an implicit batch of size
> one and that's also OK. Not sure what semantics nesting would have
> but I doubt it matters much.
I'm a bit unexcited about the work to redesign this, but I also admit that you
have a point :)
Linux calls a similar concept "plugging" the queue. I think I like "batch"
better, but only marginally.
> > Things that need to be fixed / are fixed in this:
> > - max pinned buffers should be limited by io_combine_limit, not * 4
> > - overflow distance
> > - pins need to be limited in more places
>
> I have patches for these and a few more things and will post in a
> separate thread shortly because they can be understood without
> reference to this AIO stuff and that'll hopefully be more digestible.
Yay!
> + /* IO finished, but result has not yet been processed */
> + PGAIO_HS_COMPLETED_IO,
> +
> + /* IO completed, shared completion has been called */
> + PGAIO_HS_COMPLETED_SHARED,
> +
> + /* IO completed, local completion has been called */
> + PGAIO_HS_COMPLETED_LOCAL,
>
> (Repeating something I mentioned in off-list bikeshedding) I wondered
> if it might be clearer to use the terminology "terminated" for the
> work that PostgreSQL has to do after an I/O completes, instead of
> overloading/subdividing the term "completed". We already "terminate"
> an I/O when smgr I/O completes in pre-existing bufmgr terminology, and
> this feels like a sort of generalisation of that notion.
I have a mild laziness preference for complete over terminate, but not
more. If others agree with Thomas, I'm ok with changing it.
> In this AIO world, some work is done by the backend that receives the
> completion notification from the kernel, and some is done by the backend
> that submitted the I/O in the first place, a division that doesn't exist
> with simple synchronous system calls. I wonder if it would be clearer to
> use terms based on those two roles, rather than "shared" and "local",
> leading to something like:
>
> PGAIO_HS_COMPLETED,
> PGAIO_HS_TERMINATED_BY_COMPLETER,
> PGAIO_HS_TERMINATED_BY_SUBMITTER,
I don't love those, because the SHARED/LOCAL does imply more clearly what you
have access to. I.e. executing things in a shared completion callback for IO
on a temporary buffer doesn't make sense, you won't have access to the local
buffer table.
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2025-02-10 23:10:46 | Re: AIO v2.3 |
Previous Message | Jelte Fennema-Nio | 2025-02-10 22:52:17 | Re: RFC: Allow EXPLAIN to Output Page Fault Information |