Quick Links

Re: AIO v2.5

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc:	Antonin Houska <ah(at)cybertec(dot)at>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
Subject:	Re: AIO v2.5
Date:	2025-03-25 00:45:37
Message-ID:	20250325004537.17.nmisch@google.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Mon, Mar 24, 2025 at 11:43:47AM -0400, Andres Freund wrote:
> On 2025-03-23 17:29:39 -0700, Noah Misch wrote:
> > commit 247ce06b wrote:
> > > + pgaio_io_reopen(ioh);
> > > +
> > > + /*
> > > + * To be able to exercise the reopen-fails path, allow injection
> > > + * points to trigger a failure at this point.
> > > + */
> > > + pgaio_io_call_inj(ioh, "AIO_WORKER_AFTER_REOPEN");
> > > +
> > > + error_errno = 0;
> > > + error_ioh = NULL;
> > > +
> > > + /*
> > > + * We don't expect this to ever fail with ERROR or FATAL, no need
> > > + * to keep error_ioh set to the IO.
> > > + * pgaio_io_perform_synchronously() contains a critical section to
> > > + * ensure we don't accidentally fail.
> > > + */
> > > + pgaio_io_perform_synchronously(ioh);
> >
> > A CHECK_FOR_INTERRUPTS() could close() the FD that pgaio_io_reopen() callee
> > smgr_aio_reopen() stores. Hence, I think smgrfd() should assert that
> > interrupts are held instead of doing its own HOLD_INTERRUPTS(), and a
> > HOLD_INTERRUPTS() should surround the above region of code. It's likely hard
> > to reproduce a problem, because pgaio_io_call_inj() does nothing in many
> > builds, and pgaio_io_perform_synchronously() starts by entering a critical
> > section.
>
> Hm, I guess you're right - it would be pretty bonkers for the injection to
> process interrupts, but its much better to clarify the code to make that not
> an option. Once doing that it seemed we should also have a similar assertion
> in pgaio_io_before_prep() would be appropriate.

Agreed. Following that line of thinking, the io_uring case needs to
HOLD_INTERRUPTS() (or hold smgrrelease() specifically) all the way from
pgaio_io_before_prep() to PGAIO_HS_SUBMITTED. The fd has to stay valid until
io_uring_submit().

(We may be due for a test mode that does smgrreleaseall() at every
CHECK_FOR_INTERRUPTS()?)

> > On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote:
> > > Subject: [PATCH v2.11 06/27] aio: Implement support for reads in smgr/md/fd
> >
> > > +int
> > > +FileStartReadV(PgAioHandle *ioh, File file,
> > > + int iovcnt, off_t offset,
> > > + uint32 wait_event_info)
> > > +{
> > > + int returnCode;
> > > + Vfd *vfdP;
> > > +
> > > + Assert(FileIsValid(file));
> > > +
> > > + DO_DB(elog(LOG, "FileStartReadV: %d (%s) " INT64_FORMAT " %d",
> > > + file, VfdCache[file].fileName,
> > > + (int64) offset,
> > > + iovcnt));
> > > +
> > > + returnCode = FileAccess(file);
> > > + if (returnCode < 0)
> > > + return returnCode;
> > > +
> > > + vfdP = &VfdCache[file];
> > > +
> > > + pgaio_io_prep_readv(ioh, vfdP->fd, iovcnt, offset);
> >
> > FileStartReadV() and pgaio_io_prep_readv() advance the IO to PGAIO_HS_STAGED
> > w/ batch mode, PGAIO_HS_SUBMITTED w/o batch mode. I didn't expect that from
> > functions so named. The "start" verb sounds to me like unconditional
> > PGAIO_HS_SUBMITTED, and the "prep" verb sounds like PGAIO_HS_DEFINED. I like
> > the "stage" verb, because it matches PGAIO_HS_STAGED, and the comment at
> > PGAIO_HS_STAGED succinctly covers what to expect. Hence, I recommend names
> > FileStageReadV, pgaio_io_stage_readv, mdstagereadv, and smgrstageread. How do
> > you see it?
>
> I have a surprisingly strong negative reaction to that proposed naming. To me
> the staging is a distinct step that happens *after* the IO is fully
> defined. Making all the layered calls that lead up to that named that way
> would IMO be a bad idea.

As a general naming principle, I think the name of a function that advances
through multiple named steps should mention the last step. Naming the
function after just a non-last step feels weird to me. For example, serving a
meal consists of steps menu_define, mix_ingredients, and plate_food. It would
be weird to me if a function called meal_menu_define() mixed ingredients or
plated food, but it's fine if meal_plate_food() does all three steps. A
second strategy is to name both the first and last steps:
meal_define_menu_thru_plate_food() is fine apart from being long. A third
strategy is to have meal_plate_food() assert than meal_mix_ingredients() has
been called.

I wouldn't mind "staging" as a distinct step, but I think today's API
boundaries hide the distinction. PGAIO_HS_DEFINED is a temporary state during
a pgaio_io_stage() call, so the process that defines and stages the IO can
observe PGAIO_HS_DEFINED only while pgaio_io_stage() is on the stack.

The aforementioned "third strategy" could map to having distinct
smgrdefinereadv() and smgrstagereadv(). I don't know how well that would work
out overall. I wouldn't be optimistic about that winning, but I mention it
for completeness.

> I however don't particularly like the *start* or *prep* names, I've gone back
> and forth on those a couple times. I could see "begin" work uniformly across
> those.

For ease of new readers understanding things, I think it helps for the
functions that advance PgAioHandleState to have names that use words from
PgAioHandleState. It's one less mapping to get into the reader's head.
"Begin", "Start" and "prep" are all outside that taxonomy, making the reader
learn how to map them to the taxonomy. What reward does the reader get at the
end of that exercise? I'm not seeing one, but please do tell me what I'm
missing here.

> > > Subject: [PATCH v2.11 08/27] localbuf: Track pincount in BufferDesc as well
> >
> > > @@ -5350,6 +5350,18 @@ ConditionalLockBufferForCleanup(Buffer buffer)
> > > Assert(refcount > 0);
> > > if (refcount != 1)
> > > return false;
> > > +
> > > + /*
> > > + * Check that the AIO subsystem doesn't have a pin. Likely not
> > > + * possible today, but better safe than sorry.
> > > + */
> > > + bufHdr = GetLocalBufferDescriptor(-buffer - 1);
> > > + buf_state = pg_atomic_read_u32(&bufHdr->state);
> > > + refcount = BUF_STATE_GET_REFCOUNT(buf_state);
> > > + Assert(refcount > 0);
> > > + if (refcount != 1)
> > > + return false;
> > > +
> >
> > LockBufferForCleanup() should get code like this
> > ConditionalLockBufferForCleanup() code, either now or when "not possible
> > today" ends. Currently, it just assumes all local buffers are
> > cleanup-lockable:
> >
> > /* Nobody else to wait for */
> > if (BufferIsLocal(buffer))
> > return;
>
> Kinda, yes, kinda no? LockBufferForCleanup() assumes, even for shared
> buffers, that the current backend can't be doing anything that conflicts with
> acquiring a buffer pin - note that it doesn't check the backend local pincount
> for shared buffers either.

It checks the local pincount via callee CheckBufferIsPinnedOnce().

As the patch stands, LockBufferForCleanup() can succeed when
ConditionalLockBufferForCleanup() would have returned false. I'm not seeking
to raise the overall standard of *Cleanup() family of functions, but I am
trying to keep members of that family agreeing on the standard.

Like the comment, I expect it's academic today. I expect it will stay
academic. Anything that does a cleanup will start by reading the buffer,
which will resolve any refcnt the AIO subsystems holds for a read. If there's
an AIO write, the LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE) will block on
that. How about just removing the ConditionalLockBufferForCleanup() changes
or replacing them with a comment (like the present paragraph)?

> I think this points to the comment above the WaitIO() in InvalidateBuffer()
> needing a bit of adapting - an in-progress read can trigger the WaitIO as
> well. Something like:
>
> /*
> * We assume the reason for it to be pinned is that either we were
> * asynchronously reading the page in before erroring out or someone else
> * is flushing the page out. Wait for the IO to finish. (This could be
> * an infinite loop if the refcount is messed up... it would be nice to
> * time out after awhile, but there seems no way to be sure how many loops
> * may be needed. Note that if the other guy has pinned the buffer but
> * not yet done StartBufferIO, WaitIO will fall through and we'll
> * effectively be busy-looping here.)
> */

Agreed.

> > > +buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
> > > + bool failed, bool is_temp)
> > > +{
> > ...
> > > + PgAioResult result;
> > ...
> > > + result.status = PGAIO_RS_OK;
> > ...
> > > + return result;
> >
> > gcc 14.2.0 -Werror gives me:
> >
> > bufmgr.c:7297:16: error: ‘result’ may be used uninitialized [-Werror=maybe-uninitialized]
>
> Gngngng. Since when is it a bug for some fields of a struct to be
> uninitialized, as long they're not used?
>
> Interestingly I don't see that warning, despite also using gcc 14.2.0.

I badly neglected to mention my non-default flags:

CFLAGS='-O2 -fno-sanitize-recover=all -fsanitize=address,alignment,undefined --param=max-vartrack-size=150000000 -ftrivial-auto-var-init=pattern'
COPT=-Werror -Wno-error=array-bounds

Final CFLAGS, including the ones "configure" elects on its own:

configure: using CFLAGS=-Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Werror=unguarded-availability-new -Wendif-labels -Wmissing-format-attribute -Wcast-function-type -Wformat-security -Wmissing-variable-declarations -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-unused-command-line-argument -Wno-compound-token-split-by-macro -Wno-format-truncation -Wno-cast-function-type-strict -g -O2 -fno-sanitize-recover=all -fsanitize=address,alignment,undefined --param=max-vartrack-size=150000000 -ftrivial-auto-var-init=pattern

(I use -Wno-error=array-bounds because the sanitizer options elicit a lot of
those warnings. Today's master is free from maybe-uninitialized warnings in
this configuration, though.)

> I'll just move to your solution, but it seems odd.

Got it.

> I think regardless of what we go for, it's worth splitting
> "aio: Basic read_stream adjustments for real AIO"
> into the actually basic parts (i.e. introducing sync_mode) from the not
> actually so basic parts (i.e. batching).

Fair.

On Mon, Mar 24, 2025 at 06:55:22PM -0400, Andres Freund wrote:
> Hi,
>
> On 2025-03-24 11:43:47 -0400, Andres Freund wrote:
> > > I didn't write a test to prove it, but I'm suspecting we'll reach the above
> > > ERROR with this sequence:
> > >
> > > CREATE TEMP TABLE foo ...;
> > > [some command that starts reading a block of foo into local buffers, then ERROR with IO ongoing]
> > > DROP TABLE foo;
> >
> > That seems plausible. I'll try to write a test after this email.
>
> FWIW, a test did indeed confirm that. Luckily:
>
> > > DropRelationAllLocalBuffers() calls InvalidateLocalBuffer(bufHdr, true). I
> > > think we'd need to do like pgaio_shutdown() and finish all IOs (or all IOs for
> > > the particular rel) before InvalidateLocalBuffer(). Or use something like the
> > > logic near elog(ERROR, "buffer is pinned in InvalidateBuffer") in
> > > corresponding bufmgr code.
> >
> > Just waiting for the IO in InvalidateBuffer() does seem like the best bet to
> > me.
>
> This did indeed resolve the issue.

I'm happy with that approach.

On Tue, Mar 25, 2025 at 01:07:49PM +1300, Thomas Munro wrote:
> On Tue, Mar 25, 2025 at 11:55 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > If a callback may sometimes need to block, it can still opt into
> > READ_STREAM_USE_BATCHING, by submitting all staged IO before blocking.
> >
> > The hardest part is to explain the flag. Here's my current attempt:
> >
> > /* ---
> > * Opt-in to using AIO batchmode.
> > *
> > * Submitting IO in larger batches can be more efficient than doing so
> > * one-by-one, particularly for many small reads. It does, however, require
> > * the ReadStreamBlockNumberCB callback to abide by the restrictions of AIO
> > * batching (c.f. pgaio_enter_batchmode()). Basically, the callback may not:
> > * a) block without first calling pgaio_submit_staged(), unless a
> > * to-be-waited-on lock cannot be part of a deadlock, e.g. because it is
> > * never acquired in a nested fashion
> > * b) directly or indirectly start another batch pgaio_enter_batchmode()

I think a callback could still do:

pgaio_exit_batchmode()
... arbitrary code that might reach pgaio_enter_batchmode() ...
pgaio_enter_batchmode()

> > *
> > * As this requires care and is nontrivial in some cases, batching is only
> > * used with explicit opt-in.
> > * ---
> > */
> > #define READ_STREAM_USE_BATCHING 0x08
>
> +1

Agreed. It's simple, and there's no loss of generality.

> I wonder if something more like READ_STREAM_CALLBACK_BATCHMODE_AWARE
> would be better, to highlight that you are making a declaration about
> a property of your callback, not just turning on an independent
> go-fast feature... I fished those words out of the main (?)
> description of this topic atop pgaio_enter_batchmode(). Just a
> thought, IDK.

Good points. I lean toward your renaming suggestion, or shortening to
READ_STREAM_BATCHMODE_AWARE or READ_STREAM_BATCH_OK. I'm also fine with the
original name, though.

Thanks,
nm

In response to

Re: AIO v2.5 at 2025-03-25 00:07:49 from Thomas Munro

Responses

Re: AIO v2.5 at 2025-03-25 02:30:27 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andres Freund	2025-03-25 01:18:06	Re: AIO v2.5
Previous Message	Masahiko Sawada	2025-03-25 00:45:10	Re: Make COPY format extendable: Extract COPY TO format implementations