Re: AIO v2.5

From: Andres Freund <andres(at)anarazel(dot)de>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: Antonin Houska <ah(at)cybertec(dot)at>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
Subject: Re: AIO v2.5
Date: 2025-03-19 01:50:11
Message-ID: pb3igtyldnc6z7jenl74por3le5nhxqxtk5vxlomwhqwu2y23d@lj2ndybl2zcs
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2025-03-18 21:00:17 -0400, Melanie Plageman wrote:
> On Tue, Mar 18, 2025 at 4:12 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > Attached is v2.10,
>
> This is a review of 0008: bufmgr: Implement AIO read support
>
> I'm afraid it is more of a cosmetic review than a sign-off on the
> patch's correctness, but perhaps it will help future readers who may
> have the same questions I did.

I think that's actually an important level of review. I'm, as odd as that
sounds, more confident about the architectural stuff than about
"understandability" etc.

> In the commit message:
> bufmgr: Implement AIO read support
>
> This implements the following:
> - Addition of callbacks to maintain buffer state on completion of a readv
> - Addition of a wait reference to BufferDesc, to allow backends to wait for
> IOs
> - StartBufferIO(), WaitIO(), TerminateBufferIO() support for waiting AIO
>
> I think it might be nice to say something about allowing backends to
> complete IOs issued by other backends.

Hm, I'd have said that's basically implied by the way AIO works (as outlined
in the added README.md), but I can think of a way to mention it here.

> @@ -40,6 +41,10 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
> CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
>
> CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb),
> +
> + CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
> +
> + CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_READV, aio_local_buffer_readv_cb),
> #undef CALLBACK_ENTRY
> };
>
> I personally can't quite figure out why the read and write callbacks
> are defined differently than the stage, complete, and report
> callbacks. I know there is a comment above PgAioHandleCallbackID about
> something about this, but it didn't really clarify it for me. Maybe
> you can put a block comment at the top of aio_callback.c? Or perhaps I
> just need to study it more...

They're not implemented differently - PgAioHandleCallbacks (which is what is
contained in aio_handle_cbs, just with a name added) all have a stage,
complete and report callbacks.

E.g. for SHARED_BUFFER_READV you have a stage (to transfer the buffer pins to
the AIO subsystem), a shared completion (to verify the page, to remove
BM_IO_IN_PROGRESS and set BM_VALID/BM_IO_ERROR, as appropriate) and a report
callback (to report a page validation error).

Maybe more of the relevant types and functions should have been plural, but
then it becomes very awkward to talk about the separate registrations of
multiple callbacks (i.e. the set of callbacks for md.c and the set of
callbacks for bufmgr.c).

> @@ -5482,10 +5503,19 @@ WaitIO(BufferDesc *buf)
> + if (pgaio_wref_valid(&iow))
> + {
> + pgaio_wref_wait(&iow);
> + ConditionVariablePrepareToSleep(cv);
> + continue;
> + }
>
> I'd add some comment above this. I reread it many times, and I still
> only _think_ I understand what it does. I think the reason we need
> ConditionVariablePrepareToSleep() again is because pgaio_io_wait() may
> have called ConditionVariableCancelSleep() so we need to
> ConditionVariablePrepareToSleep() again (it was done already at the
> top of Wait())?

Oh, yes, that definitely needs a comment. I've been marinating in this for so
long that it seems obvious, but if I take a step back, it's not at all
obvious.

The issue is that pgaio_wref_wait() internally waits on a *different*
condition variable than the BufferDesc's CV. The consequences of not doing
this would be fairly mild - the next ConditionVariableSleep would prepare to
sleep and return immediately - but it's unnecessary.

> Maybe worth mentioning in the commit message about why WaitIO() has to
> work differently for AIO than sync IO.

K.

> /*
> * Support LockBufferForCleanup()
> *
> * If we just released a pin, need to do BM_PIN_COUNT_WAITER handling.
> * Most of the time the current backend will hold another pin preventing
> * that from happening, but that's e.g. not the case when completing an IO
> * another backend started.
> */
>
> I found this wording a bit confusing. what about this:
>
> We may have just released the last pin other than the waiter's. In most cases,
> this backend holds another pin on the buffer. But, if, for example, this
> backend is completing an IO issued by another backend, it may be time to wake
> the waiter.

WFM.

> /*
> * Helper for AIO staging callback for both reads and writes as well as temp
> * and shared buffers.
> */
> static pg_attribute_always_inline void
> buffer_stage_common(PgAioHandle *ioh, bool is_write, bool is_temp)
>
> I think buffer_stage_common() needs the function comment to explain
> what unit it is operating on.
> It is called "buffer_" singular but then it loops through io_data
> which appears to contain multiple buffers

Hm. Yea. Originally it was just for readv and was duplicated for writes. The
vectorized bit hinted at being for multiple buffers.

> /*
> * Check that all the buffers are actually ones that could conceivably
> * be done in one IO, i.e. are sequential.
> */
> if (i == 0)
> first = buf_hdr->tag;
> else
> {
> Assert(buf_hdr->tag.relNumber == first.relNumber);
> Assert(buf_hdr->tag.blockNum == first.blockNum + i);
> }
>
> So it is interesting to me that this validation is done at this level.
> Enforcing sequentialness doesn't seem like it would be intrinsically
> related to or required to stage IOs. And there isn't really anything
> in this function that seems like it would require it either. Usually
> an assert is pretty close to the thing it is protecting.

Staging is the last buffer-aware thing that happens before IO is actually
executed. If you were to do a readv() into *non* buffers that aren't for
sequential blocks, you would get bogus buffer pool contents, because obviously
it doesn't make sense to read data for block N+1 into the buffer for block N+3
or whatnot.

The assertions did find bugs during development, fwiw.

> Oh and I think the end of the loop in buffer_stage_common() would look
> nicer with a small refactor with the resulting code looking like this:
>
> /* temp buffers don't use BM_IO_IN_PROGRESS */
> Assert(!is_temp || (buf_state & BM_IO_IN_PROGRESS));
>
> /* we better have ensured the buffer is present until now */
> Assert(BUF_STATE_GET_REFCOUNT(buf_state) >= 1);
>
> /*
> * Reflect that the buffer is now owned by the subsystem.
> *
> * For local buffers: This can't be done just in LocalRefCount as one
> * might initially think, as this backend could error out while AIO is
> * still in progress, releasing all the pins by the backend itself.
> */
> buf_state += BUF_REFCOUNT_ONE;
> buf_hdr->io_wref = io_ref;
>
> if (is_temp)
> {
> pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
> continue;
> }
>
> UnlockBufHdr(buf_hdr, buf_state);
>
> if (is_write)
> {
> LWLock *content_lock;
>
> content_lock = BufferDescriptorGetContentLock(buf_hdr);
>
> Assert(LWLockHeldByMe(content_lock));
>
> /*
> * Lock is now owned by AIO subsystem.
> */
> LWLockDisown(content_lock);
> }
>
> /*
> * Stop tracking this buffer via the resowner - the AIO system now
> * keeps track.
> */
> ResourceOwnerForgetBufferIO(CurrentResourceOwner, buffer);
> }

I don't particularly like this, I'd like to make the logic for shared and
local buffers more similar over time. E.g. by also tracking local buffer IO
via resowner.

> In buffer_readv_complete(), this comment
>
> /*
> * Iterate over all the buffers affected by this IO and call appropriate
> * per-buffer completion function for each buffer.
> */
>
> makes it sound like we might invoke different completion functions (like invoke
> the completion callback), but that isn't what happens here.

Oops, that's how it used to work, but it doesn't anymore, because it ended up
with too much duplication.

> failed =
> prior_result.status == ARS_ERROR
> || prior_result.result <= buf_off;
>
> Though not introduced in this commit, I will say that I find the ARS prefix not
> particularly helpful. Though not as brief, something like AIO_RS_ERROR would
> probably be more clear.

Fair enough. I'd go for PGAIO_RS_ERROR etc though.

> @@ -515,10 +517,25 @@ MarkLocalBufferDirty(Buffer buffer)
> * Like StartBufferIO, but for local buffers
> */
> bool
> -StartLocalBufferIO(BufferDesc *bufHdr, bool forInput)
> +StartLocalBufferIO(BufferDesc *bufHdr, bool forInput, bool nowait)
> {
> - uint32 buf_state = pg_atomic_read_u32(&bufHdr->state);
> + uint32 buf_state;
> +
> + /*
> + * The buffer could have IO in progress, e.g. when there are two scans of
> + * the same relation. Either wait for the other IO or return false.
> + */
> + if (pgaio_wref_valid(&bufHdr->io_wref))
> + {
> + PgAioWaitRef iow = bufHdr->io_wref;
> +
> + if (nowait)
> + return false;
> +
> + pgaio_wref_wait(&iow);
> + }
>
> + buf_state = pg_atomic_read_u32(&bufHdr->state);
> if (forInput ? (buf_state & BM_VALID) : !(buf_state & BM_DIRTY))
> {
> /* someone else already did the I/O */
>
> I'd move this comment ("someone else already did") outside of the if
> statement so it kind of separates it into three clear cases:

FWIW it's inside because that's how StartBufferIOs comment has been for a fair
while...

> 1) the IO is in progress and you can wait on it if you want,
> 2) the IO is completed by someone else (is this possible for local buffers
> without AIO?)

No, that's not possible without AIO.

> 3) you can start the IO

I'll give it a go.

Thanks for the review!

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Xuneng Zhou 2025-03-19 01:53:16 Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits
Previous Message Nathan Bossart 2025-03-19 01:42:45 Re: Disabling vacuum truncate for autovacuum