From: | Noah Misch <noah(at)leadboat(dot)com> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Antonin Houska <ah(at)cybertec(dot)at>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Jelte Fennema-Nio <postgres(at)jeltef(dot)nl> |
Subject: | Re: AIO v2.5 |
Date: | 2025-03-25 13:33:21 |
Message-ID: | 20250325133321.54.nmisch@google.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Mar 24, 2025 at 10:30:27PM -0400, Andres Freund wrote:
> On 2025-03-24 17:45:37 -0700, Noah Misch wrote:
> > (We may be due for a test mode that does smgrreleaseall() at every
> > CHECK_FOR_INTERRUPTS()?)
>
> I suspect we are. I'm a bit afraid of even trying...
>
> ...
>
> It's extremely slow - but at least the main regression as well as the aio tests pass!
One less thing!
> > > I however don't particularly like the *start* or *prep* names, I've gone back
> > > and forth on those a couple times. I could see "begin" work uniformly across
> > > those.
> >
> > For ease of new readers understanding things, I think it helps for the
> > functions that advance PgAioHandleState to have names that use words from
> > PgAioHandleState. It's one less mapping to get into the reader's head.
>
> Unfortunately for me it's kind of the opposite in this case, see below.
>
>
> > "Begin", "Start" and "prep" are all outside that taxonomy, making the reader
> > learn how to map them to the taxonomy. What reward does the reader get at the
> > end of that exercise? I'm not seeing one, but please do tell me what I'm
> > missing here.
>
> Because the end state varies, depending on the number of previously staged
> IOs, the IO method and whether batchmode is enabled, I think it's better if
> the "function naming pattern" (i.e. FileStartReadv, smgrstartreadv etc) is
> *not* aligned with an internal state name. It will just mislead readers to
> think that there's a deterministic mapping when that does not exist.
That's fair. Could we provide the mapping in a comment, something like the
following?
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -34,5 +34,10 @@
* linearly through all states.
*
- * State changes should all go through pgaio_io_update_state().
+ * State changes should all go through pgaio_io_update_state(). Its callers
+ * use these naming conventions:
+ *
+ * - A "start" function (e.g. FileStartReadV()) moves an IO from
+ * PGAIO_HS_HANDED_OUT to at least PGAIO_HS_STAGED and at most
+ * PGAIO_HS_COMPLETED_LOCAL.
*/
typedef enum PgAioHandleState
> That's not an excuse for pgaio_io_prep* though, that's a pointlessly different
> naming that I just stopped seeing.
>
> I'll try to think more about this, perhaps I can make myself see your POV
> more.
> > As the patch stands, LockBufferForCleanup() can succeed when
> > ConditionalLockBufferForCleanup() would have returned false.
>
> That's already true today, right? In master ConditionalLockBufferForCleanup()
> for temp buffers checks LocalRefCount, whereas LockBufferForCleanup() doesn't.
I'm finding a LocalRefCount check under LockBufferForCleanup:
LockBufferForCleanup(Buffer buffer)
{
...
CheckBufferIsPinnedOnce(buffer);
CheckBufferIsPinnedOnce(Buffer buffer)
{
if (BufferIsLocal(buffer))
{
if (LocalRefCount[-buffer - 1] != 1)
elog(ERROR, "incorrect local pin count: %d",
LocalRefCount[-buffer - 1]);
}
else
{
if (GetPrivateRefCount(buffer) != 1)
elog(ERROR, "incorrect local pin count: %d",
GetPrivateRefCount(buffer));
}
}
> > Like the comment, I expect it's academic today. I expect it will stay
> > academic. Anything that does a cleanup will start by reading the buffer,
> > which will resolve any refcnt the AIO subsystems holds for a read. If there's
> > an AIO write, the LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE) will block on
> > that. How about just removing the ConditionalLockBufferForCleanup() changes
> > or replacing them with a comment (like the present paragraph)?
>
> I think we'll need an expanded version of what I suggest once we have writes -
> but as you say, it shouldn't matter as long as we only have reads. So I think
> moving the relevant changes, with adjusted caveats, to the bufmgr: write
> change makes sense.
Moving those changes works for me. I'm not currently seeing the need under
writes, but that may get clearer upon reaching those patches.
> > > > /* ---
> > > > * Opt-in to using AIO batchmode.
> > > > *
> > > > * Submitting IO in larger batches can be more efficient than doing so
> > > > * one-by-one, particularly for many small reads. It does, however, require
> > > > * the ReadStreamBlockNumberCB callback to abide by the restrictions of AIO
> > > > * batching (c.f. pgaio_enter_batchmode()). Basically, the callback may not:
> > > > * a) block without first calling pgaio_submit_staged(), unless a
> > > > * to-be-waited-on lock cannot be part of a deadlock, e.g. because it is
> > > > * never acquired in a nested fashion
> > > > * b) directly or indirectly start another batch pgaio_enter_batchmode()
> >
> > I think a callback could still do:
> >
> > pgaio_exit_batchmode()
> > ... arbitrary code that might reach pgaio_enter_batchmode() ...
> > pgaio_enter_batchmode()
>
> Yea - but I somehow doubt there are many cases where it makes sense to
> deep-queue IOs within the callback. The cases I can think of are things like
> ensuring the right VM buffer is in s_b. But if it turns out to be necessary,
> what you seuggest would be an out.
I don't foresee a callback specifically wanting to batch, but callbacks might
call into other infrastructure that can elect to batch. The exit+reenter
pattern would be better than adding no-batch options to other infrastructure.
> Do you think it's worth mentioning the above workaround? I'm mildly inclined
> that not.
Perhaps not in that detail, but perhaps we can rephrase (b) to not imply
exit+reenter is banned. Maybe "(b) start another batch (without first exiting
one)". It's also fine as-is, though.
> If it turns out to be actually useful to do nested batching, we can change it
> so that nested batching *is* allowed, that'd not be hard.
Good point.
> I'm ok with all of these. In order of preference:
>
> 1) READ_STREAM_USE_BATCHING or READ_STREAM_BATCH_OK
> 2) READ_STREAM_BATCHMODE_AWARE
> 3) READ_STREAM_CALLBACK_BATCHMODE_AWARE
Same for me.
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Eisentraut | 2025-03-25 13:39:17 | Re: Support for NO INHERIT to INHERIT state change with named NOT NULL constraints |
Previous Message | John Naylor | 2025-03-25 13:18:41 | Re: Improve CRC32C performance on SSE4.2 |