Re: AIO v2.3

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
Subject: Re: AIO v2.3
Date: 2025-02-10 22:48:38
Message-ID: CA+hUKGLxH1tsUgzZfng4BU6GqnS6bKF2ThvxH1_w5c7-sLRKQw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jan 23, 2025 at 5:29 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
+/* caller will issue more io, don't submit */
+#define READ_BUFFERS_MORE_MORE_MORE (1 << 3)

> - Heikki doesn't love pgaio_submit_staged(), suggested pgaio_kick_staged() or
> such. I don't love that name though.

Problem statement: You want to be able to batch I/O submission, ie
make a single call to ioring_enter() (and other mechanisms) to start
several I/Os, but the code that submits is inside StartReadBuffers()
and the code that knows how many I/Os it wants to start now is at a
higher level, read_stream.c and in future elsewhere. So you invented
this flag to tell StartReadBuffers() not to call
pgaio_submit_staged(), because you promise to do it later, via this
staging list. Additionally, there is a kind of programming rule here
that you *must* submit I/Os that you stage, you aren't allowed to (for
example) stage I/Os and then sleep, so it has to be a fairly tight
piece of code.

Would the API be better like this?: When you want to create a batch
of I/Os submitted together, you wrap the work in pgaio_begin_batch()
and pgaio_submit_batch(), eg the loop in read_stream_lookahead().
Then bufmgr wouldn't need this flag: when it (or anything else) calls
smgrstartreadv(), if there is not currently an explicit batch then it
would be submitted immediately, and otherwise it would only be staged.
This way, batch construction (or whatever word you prefer for batch)
is in a clearly and explicitly demarcated stretch of code in one
lexical scope (though its effect is dynamically scoped just like the
staging list itself because we don't want to pass explicit I/O
contexts through the layers), but code that doesn't call those and
reaches AsyncReadBuffer() or whatever gets an implicit batch of size
one and that's also OK. Not sure what semantics nesting would have
but I doubt it matters much.

> Things that need to be fixed / are fixed in this:
> - max pinned buffers should be limited by io_combine_limit, not * 4
> - overflow distance
> - pins need to be limited in more places

I have patches for these and a few more things and will post in a
separate thread shortly because they can be understood without
reference to this AIO stuff and that'll hopefully be more digestible.

+ /*
+ * In some rare-ish cases one operation causes multiple reads (e.g. if a
+ * buffer was concurrently read by another backend). It'd be much better
+ * if we ensured that each ReadBuffersOperation covered only one IO - but
+ * that's not entirely trivial, due to having pinned victim buffers before
+ * starting IOs.
+ *
+ * TODO: Change the API of StartReadBuffers() to ensure we only ever need
+ * one IO.

Likewise.

+ /* IO finished, but result has not yet been processed */
+ PGAIO_HS_COMPLETED_IO,
+
+ /* IO completed, shared completion has been called */
+ PGAIO_HS_COMPLETED_SHARED,
+
+ /* IO completed, local completion has been called */
+ PGAIO_HS_COMPLETED_LOCAL,

(Repeating something I mentioned in off-list bikeshedding) I wondered
if it might be clearer to use the terminology "terminated" for the
work that PostgreSQL has to do after an I/O completes, instead of
overloading/subdividing the term "completed". We already "terminate"
an I/O when smgr I/O completes in pre-existing bufmgr terminology, and
this feels like a sort of generalisation of that notion. In this AIO
world, some work is done by the backend that receives the completion
notification from the kernel, and some is done by the backend that
submitted the I/O in the first place, a division that doesn't exist
with simple synchronous system calls. I wonder if it would be clearer
to use terms based on those two roles, rather than "shared" and
"local", leading to something like:

PGAIO_HS_COMPLETED,
PGAIO_HS_TERMINATED_BY_COMPLETER,
PGAIO_HS_TERMINATED_BY_SUBMITTER,

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jelte Fennema-Nio 2025-02-10 22:52:17 Re: RFC: Allow EXPLAIN to Output Page Fault Information
Previous Message Michael Paquier 2025-02-10 22:43:09 Re: [PATCH] Optionally record Plan IDs to track plan changes for a query