Re: AIO v2.5

From: Andres Freund <andres(at)anarazel(dot)de>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Antonin Houska <ah(at)cybertec(dot)at>, pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
Subject: Re: AIO v2.5
Date: 2025-03-24 15:43:47
Message-ID: q2jujruxxksaz7k5m7wckkfz6qvmhiqy2wjwqcgqi3qaz25scb@n7hcd6qf47je
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2025-03-23 17:29:39 -0700, Noah Misch wrote:
> commit 247ce06b wrote:
> > + pgaio_io_reopen(ioh);
> > +
> > + /*
> > + * To be able to exercise the reopen-fails path, allow injection
> > + * points to trigger a failure at this point.
> > + */
> > + pgaio_io_call_inj(ioh, "AIO_WORKER_AFTER_REOPEN");
> > +
> > + error_errno = 0;
> > + error_ioh = NULL;
> > +
> > + /*
> > + * We don't expect this to ever fail with ERROR or FATAL, no need
> > + * to keep error_ioh set to the IO.
> > + * pgaio_io_perform_synchronously() contains a critical section to
> > + * ensure we don't accidentally fail.
> > + */
> > + pgaio_io_perform_synchronously(ioh);
>
> A CHECK_FOR_INTERRUPTS() could close() the FD that pgaio_io_reopen() callee
> smgr_aio_reopen() stores. Hence, I think smgrfd() should assert that
> interrupts are held instead of doing its own HOLD_INTERRUPTS(), and a
> HOLD_INTERRUPTS() should surround the above region of code. It's likely hard
> to reproduce a problem, because pgaio_io_call_inj() does nothing in many
> builds, and pgaio_io_perform_synchronously() starts by entering a critical
> section.

Hm, I guess you're right - it would be pretty bonkers for the injection to
process interrupts, but its much better to clarify the code to make that not
an option. Once doing that it seemed we should also have a similar assertion
in pgaio_io_before_prep() would be appropriate.

> On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote:
> > Attached v2.11
>
> > Subject: [PATCH v2.11 06/27] aio: Implement support for reads in smgr/md/fd
>
> > +int
> > +FileStartReadV(PgAioHandle *ioh, File file,
> > + int iovcnt, off_t offset,
> > + uint32 wait_event_info)
> > +{
> > + int returnCode;
> > + Vfd *vfdP;
> > +
> > + Assert(FileIsValid(file));
> > +
> > + DO_DB(elog(LOG, "FileStartReadV: %d (%s) " INT64_FORMAT " %d",
> > + file, VfdCache[file].fileName,
> > + (int64) offset,
> > + iovcnt));
> > +
> > + returnCode = FileAccess(file);
> > + if (returnCode < 0)
> > + return returnCode;
> > +
> > + vfdP = &VfdCache[file];
> > +
> > + pgaio_io_prep_readv(ioh, vfdP->fd, iovcnt, offset);
>
> FileStartReadV() and pgaio_io_prep_readv() advance the IO to PGAIO_HS_STAGED
> w/ batch mode, PGAIO_HS_SUBMITTED w/o batch mode. I didn't expect that from
> functions so named. The "start" verb sounds to me like unconditional
> PGAIO_HS_SUBMITTED, and the "prep" verb sounds like PGAIO_HS_DEFINED. I like
> the "stage" verb, because it matches PGAIO_HS_STAGED, and the comment at
> PGAIO_HS_STAGED succinctly covers what to expect. Hence, I recommend names
> FileStageReadV, pgaio_io_stage_readv, mdstagereadv, and smgrstageread. How do
> you see it?

I have a surprisingly strong negative reaction to that proposed naming. To me
the staging is a distinct step that happens *after* the IO is fully
defined. Making all the layered calls that lead up to that named that way
would IMO be a bad idea.

I however don't particularly like the *start* or *prep* names, I've gone back
and forth on those a couple times. I could see "begin" work uniformly across
those.

> > +/*
> > + * AIO error reporting callback for mdstartreadv().
> > + *
> > + * Errors are encoded as follows:
> > + * - PgAioResult.error_data != 0 encodes IO that failed with errno != 0
>
> I recommend replacing "errno != 0" with either "that errno" or "errno ==
> error_data".

Done.

> > Subject: [PATCH v2.11 07/27] aio: Add README.md explaining higher level design
>
> Ready for commit apart from some trivia:

Great.

> > +if (ioret.result.status == PGAIO_RS_ERROR)
> > + pgaio_result_report(aio_ret.result, &aio_ret.target_data, ERROR);
>
> I think ioret and aio_ret are supposed to be the same object. If that's
> right, change one of the names. Likewise elsewhere in this file.

You're right.

> > +The central API piece for postgres' AIO abstraction are AIO handles. To
> > +execute an IO one first has to acquire an IO handle (`pgaio_io_acquire()`) and
> > +then "defined", i.e. associate an IO operation with the handle.
>
> s/"defined"/"define" it/ or similar
>
> > +The "solution" to this the ability to associate multiple completion callbacks
>
> s/this the/this is the/

Applied.

> > Subject: [PATCH v2.11 08/27] localbuf: Track pincount in BufferDesc as well
>
> > @@ -5350,6 +5350,18 @@ ConditionalLockBufferForCleanup(Buffer buffer)
> > Assert(refcount > 0);
> > if (refcount != 1)
> > return false;
> > +
> > + /*
> > + * Check that the AIO subsystem doesn't have a pin. Likely not
> > + * possible today, but better safe than sorry.
> > + */
> > + bufHdr = GetLocalBufferDescriptor(-buffer - 1);
> > + buf_state = pg_atomic_read_u32(&bufHdr->state);
> > + refcount = BUF_STATE_GET_REFCOUNT(buf_state);
> > + Assert(refcount > 0);
> > + if (refcount != 1)
> > + return false;
> > +
>
> LockBufferForCleanup() should get code like this
> ConditionalLockBufferForCleanup() code, either now or when "not possible
> today" ends. Currently, it just assumes all local buffers are
> cleanup-lockable:
>
> /* Nobody else to wait for */
> if (BufferIsLocal(buffer))
> return;

Kinda, yes, kinda no? LockBufferForCleanup() assumes, even for shared
buffers, that the current backend can't be doing anything that conflicts with
acquiring a buffer pin - note that it doesn't check the backend local pincount
for shared buffers either.

LockBufferForCleanup() kind of has to make that assumption, because there's no
way to wait for yourself to release another pin, because obviously waiting in
LockBufferForCleanup() would prevent that from ever happening.

It's somewhat disheartening that the comments for LockBufferForCleanup() don't
mention that somehow the caller needs to be sure not to be called with other
pins on a relation. Nor does LockBufferForCleanup() have any asserts checking
how many backend-local pins exist.

Leaving documentation / asserts aside, I think this is largely a safe
assumption given current callers. With one exception, it's all vacuum or
recovery related code - as vacuum can't run in a transaction, we can't
conflict with another pin by the same backend.

The one exception is heap_surgery.c - it doesn't quite seem safe, the
surrounding (or another query with a cursor) could have a pin on the target
block. The most obvious fix would be to use CheckTableNotInUse(), but that
might break some reasonable uses. Or maybe it should just not use a cleanup
lock, it's not obvious to me why it uses one. But tbh, I don't care too much,
given what heap_surgery is.

> > @@ -570,7 +577,13 @@ InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced)
> >
> > buf_state = pg_atomic_read_u32(&bufHdr->state);
> >
> > - if (check_unreferenced && LocalRefCount[bufid] != 0)
> > + /*
> > + * We need to test not just LocalRefCount[bufid] but also the BufferDesc
> > + * itself, as the latter is used to represent a pin by the AIO subsystem.
> > + * This can happen if AIO is initiated and then the query errors out.
> > + */
> > + if (check_unreferenced &&
> > + (LocalRefCount[bufid] != 0 || BUF_STATE_GET_REFCOUNT(buf_state) != 0))
> > elog(ERROR, "block %u of %s is still referenced (local %u)",
>
> I didn't write a test to prove it, but I'm suspecting we'll reach the above
> ERROR with this sequence:
>
> CREATE TEMP TABLE foo ...;
> [some command that starts reading a block of foo into local buffers, then ERROR with IO ongoing]
> DROP TABLE foo;

That seems plausible. I'll try to write a test after this email.

> DropRelationAllLocalBuffers() calls InvalidateLocalBuffer(bufHdr, true). I
> think we'd need to do like pgaio_shutdown() and finish all IOs (or all IOs for
> the particular rel) before InvalidateLocalBuffer(). Or use something like the
> logic near elog(ERROR, "buffer is pinned in InvalidateBuffer") in
> corresponding bufmgr code.

Just waiting for the IO in InvalidateBuffer() does seem like the best bet to
me. It's going to be pretty rarely reached, waiting for all concurrent IO
seems unnecessarily heavyweight. I don't think it matters much today, but once
we do things like asynchronously writing back buffers or WAL, the situation
will be different.

I think this points to the comment above the WaitIO() in InvalidateBuffer()
needing a bit of adapting - an in-progress read can trigger the WaitIO as
well. Something like:

/*
* We assume the reason for it to be pinned is that either we were
* asynchronously reading the page in before erroring out or someone else
* is flushing the page out. Wait for the IO to finish. (This could be
* an infinite loop if the refcount is messed up... it would be nice to
* time out after awhile, but there seems no way to be sure how many loops
* may be needed. Note that if the other guy has pinned the buffer but
* not yet done StartBufferIO, WaitIO will fall through and we'll
* effectively be busy-looping here.)
*/

> > +buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
> > + bool failed, bool is_temp)
> > +{
> ...
> > + PgAioResult result;
> ...
> > + result.status = PGAIO_RS_OK;
> ...
> > + return result;
>
> gcc 14.2.0 -Werror gives me:
>
> bufmgr.c:7297:16: error: ‘result’ may be used uninitialized [-Werror=maybe-uninitialized]

Gngngng. Since when is it a bug for some fields of a struct to be
uninitialized, as long they're not used?

Interestingly I don't see that warning, despite also using gcc 14.2.0.

I'll just move to your solution, but it seems odd.

> > Subject: [PATCH v2.11 13/27] aio: Basic read_stream adjustments for real AIO
>
> > @@ -416,6 +418,13 @@ read_stream_start_pending_read(ReadStream *stream)
> > static void
> > read_stream_look_ahead(ReadStream *stream)
> > {
> > + /*
> > + * Allow amortizing the cost of submitting IO over multiple IOs. This
> > + * requires that we don't do any operations that could lead to a deadlock
> > + * with staged-but-unsubmitted IO.
> > + */
> > + pgaio_enter_batchmode();
>
> We call read_stream_get_block() while in batchmode, so the stream callback
> needs to be ready for that. A complicated case is
> collect_corrupt_items_read_stream_next_block(), which may do its own buffer
> I/O to read in a vmbuffer for VM_ALL_FROZEN(). That's feeling to me like a
> recipe for corner cases reaching ERROR "starting batch while batch already in
> progress". Are there mitigating factors?

Ugh, yes, you're right. heap_vac_scan_next_block() is also affected.

I don't think "starting batch while batch already in progress" is the real
issue though - it seems easy enough to avoid starting another batch inside,
partially because current cases seem unlikely to need to do batchable IO
inside. What worries me more is that code might block while there's
unsubmitted IO - which seems entirely plausible.

I can see a few approaches:

1) Declare that all read stream callbacks have to be careful and cope with
batch mode

I'm not sure how viable that is, not starting batches seems ok, but
ensuring that the code doesn't block is a different story.

2) Have read stream users opt-in to batching

Presumably via a flag like READ_STREAM_USE_BATCHING. That'd be easy enough
to implement and to add to the callsites where that's fine.

3) Teach read stream to "look ahead" far enough to determine all the blocks
that could be issued in a batch outside of batchmode

I think that's probably not a great idea, it'd lead us to looking further
ahead than we really need to, which could increase "unfairness" in
e.g. parallel sequential scan.

4) Just defer using batch mode for now

It's a nice win with io_uring for random IO, e.g. from bitmap heap scans ,
but there's no need to immediately solve this.

I think regardless of what we go for, it's worth splitting
"aio: Basic read_stream adjustments for real AIO"
into the actually basic parts (i.e. introducing sync_mode) from the not
actually so basic parts (i.e. batching).

I suspect that 2) would be the best approach. Only the read stream user knows
what it needs to do in the callback.

>
> > Subject: [PATCH v2.11 17/27] aio: Add test_aio module
>
> > + # verify that page verification errors are detected even as part of a
> > + # shortened multi-block read (tbl_corr, block 1 is tbl_corred)
>
> Is "tbl_corred" a typo of something?

I think that was a search&replace of the table name gone wrong. It was just
supposed to be "corrupted".

> > + *
> > + * IDENTIFICATION
> > + * src/test/modules/delay_execution/delay_execution.c
>
> Header comment is surviving from copy-paste of delay_execution.c.

Oh, how I hate these pointless comments. Fixed.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2025-03-24 15:44:13 Re: High memory usage in CachedPlan for large IN clauses in partitioned table updates
Previous Message Christoph Berg 2025-03-24 15:41:35 Re: query_id: jumble names of temp tables for better pg_stat_statement UX