Re: AIO v2.5

From: Andres Freund <andres(at)anarazel(dot)de>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Antonin Houska <ah(at)cybertec(dot)at>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
Subject: Re: AIO v2.5
Date: 2025-03-25 02:30:27
Message-ID: dbeeaize47y7esifdrinpa2l7cqqb67k72exvuf3appyxywjnc@7bt76mozhcy2
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2025-03-24 17:45:37 -0700, Noah Misch wrote:
> (We may be due for a test mode that does smgrreleaseall() at every
> CHECK_FOR_INTERRUPTS()?)

I suspect we are. I'm a bit afraid of even trying...

...

It's extremely slow - but at least the main regression as well as the aio tests pass!

> > I however don't particularly like the *start* or *prep* names, I've gone back
> > and forth on those a couple times. I could see "begin" work uniformly across
> > those.
>
> For ease of new readers understanding things, I think it helps for the
> functions that advance PgAioHandleState to have names that use words from
> PgAioHandleState. It's one less mapping to get into the reader's head.

Unfortunately for me it's kind of the opposite in this case, see below.

> "Begin", "Start" and "prep" are all outside that taxonomy, making the reader
> learn how to map them to the taxonomy. What reward does the reader get at the
> end of that exercise? I'm not seeing one, but please do tell me what I'm
> missing here.

Because the end state varies, depending on the number of previously staged
IOs, the IO method and whether batchmode is enabled, I think it's better if
the "function naming pattern" (i.e. FileStartReadv, smgrstartreadv etc) is
*not* aligned with an internal state name. It will just mislead readers to
think that there's a deterministic mapping when that does not exist.

That's not an excuse for pgaio_io_prep* though, that's a pointlessly different
naming that I just stopped seeing.

I'll try to think more about this, perhaps I can make myself see your POV
more.

> > > > Subject: [PATCH v2.11 08/27] localbuf: Track pincount in BufferDesc as well
> > > LockBufferForCleanup() should get code like this
> > > ConditionalLockBufferForCleanup() code, either now or when "not possible
> > > today" ends. Currently, it just assumes all local buffers are
> > > cleanup-lockable:
> > >
> > > /* Nobody else to wait for */
> > > if (BufferIsLocal(buffer))
> > > return;
> >
> > Kinda, yes, kinda no? LockBufferForCleanup() assumes, even for shared
> > buffers, that the current backend can't be doing anything that conflicts with
> > acquiring a buffer pin - note that it doesn't check the backend local pincount
> > for shared buffers either.
>
> It checks the local pincount via callee CheckBufferIsPinnedOnce().

In exactly one of the callers :/

> As the patch stands, LockBufferForCleanup() can succeed when
> ConditionalLockBufferForCleanup() would have returned false.

That's already true today, right? In master ConditionalLockBufferForCleanup()
for temp buffers checks LocalRefCount, whereas LockBufferForCleanup() doesn't.
I think I agree with your suggestion further below, but independent of that, I
don't see how the current modification in the patch makes the worse.

Historically this behaviour of LockBufferForCleanup() kinda somewhat makes
sense - the only place we use LockBufferForCleanup() is in a non-transactional
command i.e. vacuum / index vacuum. So LockBufferForCleanup() turns out to
only be safe in that context.

> Like the comment, I expect it's academic today. I expect it will stay
> academic. Anything that does a cleanup will start by reading the buffer,
> which will resolve any refcnt the AIO subsystems holds for a read. If there's
> an AIO write, the LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE) will block on
> that. How about just removing the ConditionalLockBufferForCleanup() changes
> or replacing them with a comment (like the present paragraph)?

I think we'll need an expanded version of what I suggest once we have writes -
but as you say, it shouldn't matter as long as we only have reads. So I think
moving the relevant changes, with adjusted caveats, to the bufmgr: write
change makes sense.

> > > /* ---
> > > * Opt-in to using AIO batchmode.
> > > *
> > > * Submitting IO in larger batches can be more efficient than doing so
> > > * one-by-one, particularly for many small reads. It does, however, require
> > > * the ReadStreamBlockNumberCB callback to abide by the restrictions of AIO
> > > * batching (c.f. pgaio_enter_batchmode()). Basically, the callback may not:
> > > * a) block without first calling pgaio_submit_staged(), unless a
> > > * to-be-waited-on lock cannot be part of a deadlock, e.g. because it is
> > > * never acquired in a nested fashion
> > > * b) directly or indirectly start another batch pgaio_enter_batchmode()
>
> I think a callback could still do:
>
> pgaio_exit_batchmode()
> ... arbitrary code that might reach pgaio_enter_batchmode() ...
> pgaio_enter_batchmode()

Yea - but I somehow doubt there are many cases where it makes sense to
deep-queue IOs within the callback. The cases I can think of are things like
ensuring the right VM buffer is in s_b. But if it turns out to be necessary,
what you seuggest would be an out.

Do you think it's worth mentioning the above workaround? I'm mildly inclined
that not.

If it turns out to be actually useful to do nested batching, we can change it
so that nested batching *is* allowed, that'd not be hard.

> > > *
> > > * As this requires care and is nontrivial in some cases, batching is only
> > > * used with explicit opt-in.
> > > * ---
> > > */
> > > #define READ_STREAM_USE_BATCHING 0x08
> >
> > +1
>
> Agreed. It's simple, and there's no loss of generality.
>
> > I wonder if something more like READ_STREAM_CALLBACK_BATCHMODE_AWARE
> > would be better, to highlight that you are making a declaration about
> > a property of your callback, not just turning on an independent
> > go-fast feature... I fished those words out of the main (?)
> > description of this topic atop pgaio_enter_batchmode(). Just a
> > thought, IDK.
>
> Good points. I lean toward your renaming suggestion, or shortening to
> READ_STREAM_BATCHMODE_AWARE or READ_STREAM_BATCH_OK. I'm also fine with the
> original name, though.

I'm ok with all of these. In order of preference:

1) READ_STREAM_USE_BATCHING or READ_STREAM_BATCH_OK
2) READ_STREAM_BATCHMODE_AWARE
3) READ_STREAM_CALLBACK_BATCHMODE_AWARE

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Sami Imseih 2025-03-25 02:38:35 Re: query_id: jumble names of temp tables for better pg_stat_statement UX
Previous Message vignesh C 2025-03-25 02:29:11 Re: [18] CREATE SUBSCRIPTION ... SERVER