Re: AIO v2.5

From: Andres Freund <andres(at)anarazel(dot)de>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Antonin Houska <ah(at)cybertec(dot)at>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
Subject: Re: AIO v2.5
Date: 2025-03-25 15:26:14
Message-ID: nk5gyjsny6w222roebnj2f7oid3stqrbh4xj5ah3xkvz52borc@hedincoyhg43
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2025-03-25 06:33:21 -0700, Noah Misch wrote:
> On Mon, Mar 24, 2025 at 10:30:27PM -0400, Andres Freund wrote:
> > On 2025-03-24 17:45:37 -0700, Noah Misch wrote:
> > > (We may be due for a test mode that does smgrreleaseall() at every
> > > CHECK_FOR_INTERRUPTS()?)
> >
> > I suspect we are. I'm a bit afraid of even trying...
> >
> > ...
> >
> > It's extremely slow - but at least the main regression as well as the aio tests pass!
>
> One less thing!

Unfortunately I'm now doubting the thoroughness of my check - while I made
every CFI() execute smgrreleaseall(), I didn't trigger CFI() in cases where we
trigger it conditionally. E.g. elog(DEBUGN, ...) only executes a CFI if
log_min_messages <= DEBUGN...

I'll try that in a bit.

> > Because the end state varies, depending on the number of previously staged
> > IOs, the IO method and whether batchmode is enabled, I think it's better if
> > the "function naming pattern" (i.e. FileStartReadv, smgrstartreadv etc) is
> > *not* aligned with an internal state name. It will just mislead readers to
> > think that there's a deterministic mapping when that does not exist.
>
> That's fair. Could we provide the mapping in a comment, something like the
> following?

Yes!

I wonder if it should also be duplicated or referenced elsewhere, although I
am not sure where precisely.

> --- a/src/include/storage/aio_internal.h
> +++ b/src/include/storage/aio_internal.h
> @@ -34,5 +34,10 @@
> * linearly through all states.
> *
> - * State changes should all go through pgaio_io_update_state().
> + * State changes should all go through pgaio_io_update_state(). Its callers
> + * use these naming conventions:
> + *
> + * - A "start" function (e.g. FileStartReadV()) moves an IO from
> + * PGAIO_HS_HANDED_OUT to at least PGAIO_HS_STAGED and at most
> + * PGAIO_HS_COMPLETED_LOCAL.
> */
> typedef enum PgAioHandleState

One detail I'm not sure about: The above change is correct, but perhaps a bit
misleading, because we can actually go "back" to IDLE. Not sure how to best
phrase that though.

> > That's not an excuse for pgaio_io_prep* though, that's a pointlessly different
> > naming that I just stopped seeing.

I assume you're on board with renaming _io_prep* to _io_start_*?

> > I'll try to think more about this, perhaps I can make myself see your POV
> > more.
>
> > > As the patch stands, LockBufferForCleanup() can succeed when
> > > ConditionalLockBufferForCleanup() would have returned false.
> >
> > That's already true today, right? In master ConditionalLockBufferForCleanup()
> > for temp buffers checks LocalRefCount, whereas LockBufferForCleanup() doesn't.
>
> I'm finding a LocalRefCount check under LockBufferForCleanup:

I guess I should have stopped looking at code / replying before my last email
last night... Not sure how I missed that.

> CheckBufferIsPinnedOnce(Buffer buffer)
> {
> if (BufferIsLocal(buffer))
> {
> if (LocalRefCount[-buffer - 1] != 1)
> elog(ERROR, "incorrect local pin count: %d",
> LocalRefCount[-buffer - 1]);
> }
> else
> {
> if (GetPrivateRefCount(buffer) != 1)
> elog(ERROR, "incorrect local pin count: %d",
> GetPrivateRefCount(buffer));
> }
> }

Pretty random orthogonal thought, that I was reminded of by the above code
snippet:

It sure seems we should at some point get rid of LocalRefCount[] and just use
the GetPrivateRefCount() infrastructure for both shared and local buffers. I
don't think the GetPrivateRefCount() infrastructure cares about
local/non-local, leaving a few asserts aside. If we do that, and start to use
BM_IO_IN_PROGRESS, combined with ResourceOwnerRememberBufferIO(), the set of
differences between shared and local buffers would be a lot smaller.

> > > Like the comment, I expect it's academic today. I expect it will stay
> > > academic. Anything that does a cleanup will start by reading the buffer,
> > > which will resolve any refcnt the AIO subsystems holds for a read. If there's
> > > an AIO write, the LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE) will block on
> > > that. How about just removing the ConditionalLockBufferForCleanup() changes
> > > or replacing them with a comment (like the present paragraph)?
> >
> > I think we'll need an expanded version of what I suggest once we have writes -
> > but as you say, it shouldn't matter as long as we only have reads. So I think
> > moving the relevant changes, with adjusted caveats, to the bufmgr: write
> > change makes sense.
>
> Moving those changes works for me. I'm not currently seeing the need under
> writes, but that may get clearer upon reaching those patches.

FWIW, I don't think it's currently worth looking at the write side in detail,
there's enough required changes to make that not necessarily the best use of
your time at this point. At least:

- Write logic needs to be rebased ontop of the patch series to not hit bit
dirty buffers while IO is going on

The performance impact of doing the memory copies is rather substantial, as
on intel memory bandwidth is *the* IO bottleneck even just for the checksum
computation, without a copy. That makes the memory copy for something like
bounce buffers hurt really badly.

And the memory usage of bounce buffers is also really concerning.

And even without checksums, several filesystems *really* don't like buffers
getting modified during DIO writes. Which I think would mean we ought to use
bounce buffers for *all* writes, which would impose a *very* substantial
overhead (basically removing the benefit of DMA happening off-cpu).

- Right now the sync.c integration with smgr.c/md.c isn't properly safe to use
in a critical section

The only reason it doesn't immediately fail is that it's reasonably rare
that RegisterSyncRequest() fails *and* either:

- smgropen()->hash_search(HASH_ENTER) decides to resize the hash table, even
though the lookup is guaranteed to suceed for io_method=worker.

- an io_method=uring completion is run in a different backend and smgropen()
needs to build a new entry and thus needs to allocate memory

For a bit I thought this could be worked around easily enough by not doing
an smgropen() in mdsyncfiletag(), or adding a "fallible" smgropen() and
instead just opening the file directly. That actually does kinda solve the
problem, but only because the memory allocation in PathNameOpenFile()
uses malloc(), not palloc() and thus doesn't trigger

- I think it requires new lwlock.c infrastructure (as v1 of aio had), to make
LockBuffer(BUFFER_LOCK_EXCLUSIVE) etc wait in a concurrency safe manner for
in-progress writes

I can think of ways to solve this purely in bufmgr.c, but only in ways that
would cause other problems (e.g. setting BM_IO_IN_PROGRESS before waiting
for an exclusive lock) and/or expensive.

- My current set of patches doesn't implement bgwriter_flush_after,
checkpointer_flush_after

I think that's not too hard to do, it's mainly round tuits.

- temp_file_limit is not respected by aio writes

I guess that could be ok if AIO writes are only used by checkpointer /
bgwriter, but we need to figure out a way to deal with that. Perhaps by
redesigning temp_file_limit, the current implementation seems like rather
substantial layering violation.

- Too much duplicated code, as there's the aio and non-aio write paths. That
might be ok for a bit.

I updated the commit messages of the relevant commits with the above, there
were abbreviated versions of most of the above, but not in enough detail for
anybody but me (and maybe not even that).

> > Do you think it's worth mentioning the above workaround? I'm mildly inclined
> > that not.
>
> Perhaps not in that detail, but perhaps we can rephrase (b) to not imply
> exit+reenter is banned. Maybe "(b) start another batch (without first exiting
> one)". It's also fine as-is, though.

I updated it to:

* b) start another batch (without first exiting batchmode and re-entering
* before returning)

> > I'm ok with all of these. In order of preference:
> >
> > 1) READ_STREAM_USE_BATCHING or READ_STREAM_BATCH_OK
> > 2) READ_STREAM_BATCHMODE_AWARE
> > 3) READ_STREAM_CALLBACK_BATCHMODE_AWARE
>
> Same for me.

For now I'll leave it at READ_STREAM_USE_BATCHING, but if Thomas has a
preference I'll go for whatever we have a majority for.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2025-03-25 15:33:33 Re: AIO v2.5
Previous Message Sami Imseih 2025-03-25 14:59:34 Re: query_id: jumble names of temp tables for better pg_stat_statement UX