Quick Links

Re: AIO v2.5

From:	Melanie Plageman <melanieplageman(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Antonin Houska <ah(at)cybertec(dot)at>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
Subject:	Re: AIO v2.5
Date:	2025-03-19 01:00:17
Message-ID:	CAAKRu_Yb+JzQpNsgUxCB0gBi+sE-mi_HmcJF6ALnmO4W+UgwpA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, Mar 18, 2025 at 4:12 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> Attached is v2.10,

This is a review of 0008: bufmgr: Implement AIO read support

I'm afraid it is more of a cosmetic review than a sign-off on the
patch's correctness, but perhaps it will help future readers who may
have the same questions I did.

In the commit message:
bufmgr: Implement AIO read support

This implements the following:
- Addition of callbacks to maintain buffer state on completion of a readv
- Addition of a wait reference to BufferDesc, to allow backends to wait for
IOs
- StartBufferIO(), WaitIO(), TerminateBufferIO() support for waiting AIO

I think it might be nice to say something about allowing backends to
complete IOs issued by other backends.

@@ -40,6 +41,10 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),

CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb),
+
+ CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
+
+ CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_READV, aio_local_buffer_readv_cb),
#undef CALLBACK_ENTRY
};

I personally can't quite figure out why the read and write callbacks
are defined differently than the stage, complete, and report
callbacks. I know there is a comment above PgAioHandleCallbackID about
something about this, but it didn't really clarify it for me. Maybe
you can put a block comment at the top of aio_callback.c? Or perhaps I
just need to study it more...

@@ -5482,10 +5503,19 @@ WaitIO(BufferDesc *buf)
+ if (pgaio_wref_valid(&iow))
+ {
+ pgaio_wref_wait(&iow);
+ ConditionVariablePrepareToSleep(cv);
+ continue;
+ }

I'd add some comment above this. I reread it many times, and I still
only _think_ I understand what it does. I think the reason we need
ConditionVariablePrepareToSleep() again is because pgaio_io_wait() may
have called ConditionVariableCancelSleep() so we need to
ConditionVariablePrepareToSleep() again (it was done already at the
top of Wait())?

I'll admit I find the CV API quite confusing, so maybe I'm just
misunderstanding it.

Maybe worth mentioning in the commit message about why WaitIO() has to
work differently for AIO than sync IO.

/*
* Support LockBufferForCleanup()
*
* If we just released a pin, need to do BM_PIN_COUNT_WAITER handling.
* Most of the time the current backend will hold another pin preventing
* that from happening, but that's e.g. not the case when completing an IO
* another backend started.
*/

I found this wording a bit confusing. what about this:

We may have just released the last pin other than the waiter's. In most cases,
this backend holds another pin on the buffer. But, if, for example, this
backend is completing an IO issued by another backend, it may be time to wake
the waiter.

/*
* Helper for AIO staging callback for both reads and writes as well as temp
* and shared buffers.
*/
static pg_attribute_always_inline void
buffer_stage_common(PgAioHandle *ioh, bool is_write, bool is_temp)

I think buffer_stage_common() needs the function comment to explain
what unit it is operating on.
It is called "buffer_" singular but then it loops through io_data
which appears to contain multiple buffers

/*
* Check that all the buffers are actually ones that could conceivably
* be done in one IO, i.e. are sequential.
*/
if (i == 0)
first = buf_hdr->tag;
else
{
Assert(buf_hdr->tag.relNumber == first.relNumber);
Assert(buf_hdr->tag.blockNum == first.blockNum + i);
}

So it is interesting to me that this validation is done at this level.
Enforcing sequentialness doesn't seem like it would be intrinsically
related to or required to stage IOs. And there isn't really anything
in this function that seems like it would require it either. Usually
an assert is pretty close to the thing it is protecting.

Oh and I think the end of the loop in buffer_stage_common() would look
nicer with a small refactor with the resulting code looking like this:

/* temp buffers don't use BM_IO_IN_PROGRESS */
Assert(!is_temp || (buf_state & BM_IO_IN_PROGRESS));

/* we better have ensured the buffer is present until now */
Assert(BUF_STATE_GET_REFCOUNT(buf_state) >= 1);

/*
* Reflect that the buffer is now owned by the subsystem.
*
* For local buffers: This can't be done just in LocalRefCount as one
* might initially think, as this backend could error out while AIO is
* still in progress, releasing all the pins by the backend itself.
*/
buf_state += BUF_REFCOUNT_ONE;
buf_hdr->io_wref = io_ref;

if (is_temp)
{
pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
continue;
}

UnlockBufHdr(buf_hdr, buf_state);

if (is_write)
{
LWLock *content_lock;

content_lock = BufferDescriptorGetContentLock(buf_hdr);

Assert(LWLockHeldByMe(content_lock));

/*
* Lock is now owned by AIO subsystem.
*/
LWLockDisown(content_lock);
}

/*
* Stop tracking this buffer via the resowner - the AIO system now
* keeps track.
*/
ResourceOwnerForgetBufferIO(CurrentResourceOwner, buffer);
}

In buffer_readv_complete(), this comment

/*
* Iterate over all the buffers affected by this IO and call appropriate
* per-buffer completion function for each buffer.
*/

makes it sound like we might invoke different completion functions (like invoke
the completion callback), but that isn't what happens here.

failed =
prior_result.status == ARS_ERROR
|| prior_result.result <= buf_off;

Though not introduced in this commit, I will say that I find the ARS prefix not
particularly helpful. Though not as brief, something like AIO_RS_ERROR would
probably be more clear.

@@ -515,10 +517,25 @@ MarkLocalBufferDirty(Buffer buffer)
* Like StartBufferIO, but for local buffers
*/
bool
-StartLocalBufferIO(BufferDesc *bufHdr, bool forInput)
+StartLocalBufferIO(BufferDesc *bufHdr, bool forInput, bool nowait)
{
- uint32 buf_state = pg_atomic_read_u32(&bufHdr->state);
+ uint32 buf_state;
+
+ /*
+ * The buffer could have IO in progress, e.g. when there are two scans of
+ * the same relation. Either wait for the other IO or return false.
+ */
+ if (pgaio_wref_valid(&bufHdr->io_wref))
+ {
+ PgAioWaitRef iow = bufHdr->io_wref;
+
+ if (nowait)
+ return false;
+
+ pgaio_wref_wait(&iow);
+ }

+ buf_state = pg_atomic_read_u32(&bufHdr->state);
if (forInput ? (buf_state & BM_VALID) : !(buf_state & BM_DIRTY))
{
/* someone else already did the I/O */

I'd move this comment ("someone else already did") outside of the if
statement so it kind of separates it into three clear cases:
1) the IO is in progress and you can wait on it if you want, 2) the IO
is completed by someone else (is this possible for local buffers
without AIO?) 3) you can start the IO

- Melanie

In response to

Re: AIO v2.5 at 2025-03-18 20:12:18 from Andres Freund

Responses

Re: AIO v2.5 at 2025-03-19 01:50:11 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	jian he	2025-03-19 01:19:42	Re: add function argument name to substring and substr
Previous Message	Dean	2025-03-19 00:56:04	Re: Proposal: Deferred Replica Filtering for PostgreSQL Logical Replication