From: | Noah Misch <noah(at)leadboat(dot)com> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Antonin Houska <ah(at)cybertec(dot)at>, pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Jelte Fennema-Nio <postgres(at)jeltef(dot)nl> |
Subject: | Re: AIO v2.5 |
Date: | 2025-03-24 00:29:39 |
Message-ID: | 20250324002939.5c.nmisch@google.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
commit 247ce06b wrote:
> + pgaio_io_reopen(ioh);
> +
> + /*
> + * To be able to exercise the reopen-fails path, allow injection
> + * points to trigger a failure at this point.
> + */
> + pgaio_io_call_inj(ioh, "AIO_WORKER_AFTER_REOPEN");
> +
> + error_errno = 0;
> + error_ioh = NULL;
> +
> + /*
> + * We don't expect this to ever fail with ERROR or FATAL, no need
> + * to keep error_ioh set to the IO.
> + * pgaio_io_perform_synchronously() contains a critical section to
> + * ensure we don't accidentally fail.
> + */
> + pgaio_io_perform_synchronously(ioh);
A CHECK_FOR_INTERRUPTS() could close() the FD that pgaio_io_reopen() callee
smgr_aio_reopen() stores. Hence, I think smgrfd() should assert that
interrupts are held instead of doing its own HOLD_INTERRUPTS(), and a
HOLD_INTERRUPTS() should surround the above region of code. It's likely hard
to reproduce a problem, because pgaio_io_call_inj() does nothing in many
builds, and pgaio_io_perform_synchronously() starts by entering a critical
section.
On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote:
> Attached v2.11
> Subject: [PATCH v2.11 06/27] aio: Implement support for reads in smgr/md/fd
> +int
> +FileStartReadV(PgAioHandle *ioh, File file,
> + int iovcnt, off_t offset,
> + uint32 wait_event_info)
> +{
> + int returnCode;
> + Vfd *vfdP;
> +
> + Assert(FileIsValid(file));
> +
> + DO_DB(elog(LOG, "FileStartReadV: %d (%s) " INT64_FORMAT " %d",
> + file, VfdCache[file].fileName,
> + (int64) offset,
> + iovcnt));
> +
> + returnCode = FileAccess(file);
> + if (returnCode < 0)
> + return returnCode;
> +
> + vfdP = &VfdCache[file];
> +
> + pgaio_io_prep_readv(ioh, vfdP->fd, iovcnt, offset);
FileStartReadV() and pgaio_io_prep_readv() advance the IO to PGAIO_HS_STAGED
w/ batch mode, PGAIO_HS_SUBMITTED w/o batch mode. I didn't expect that from
functions so named. The "start" verb sounds to me like unconditional
PGAIO_HS_SUBMITTED, and the "prep" verb sounds like PGAIO_HS_DEFINED. I like
the "stage" verb, because it matches PGAIO_HS_STAGED, and the comment at
PGAIO_HS_STAGED succinctly covers what to expect. Hence, I recommend names
FileStageReadV, pgaio_io_stage_readv, mdstagereadv, and smgrstageread. How do
you see it?
> +/*
> + * AIO error reporting callback for mdstartreadv().
> + *
> + * Errors are encoded as follows:
> + * - PgAioResult.error_data != 0 encodes IO that failed with errno != 0
I recommend replacing "errno != 0" with either "that errno" or "errno ==
error_data".
> Subject: [PATCH v2.11 07/27] aio: Add README.md explaining higher level design
Ready for commit apart from some trivia:
> +if (ioret.result.status == PGAIO_RS_ERROR)
> + pgaio_result_report(aio_ret.result, &aio_ret.target_data, ERROR);
I think ioret and aio_ret are supposed to be the same object. If that's
right, change one of the names. Likewise elsewhere in this file.
> +The central API piece for postgres' AIO abstraction are AIO handles. To
> +execute an IO one first has to acquire an IO handle (`pgaio_io_acquire()`) and
> +then "defined", i.e. associate an IO operation with the handle.
s/"defined"/"define" it/ or similar
> +The "solution" to this the ability to associate multiple completion callbacks
s/this the/this is the/
> Subject: [PATCH v2.11 08/27] localbuf: Track pincount in BufferDesc as well
> @@ -5350,6 +5350,18 @@ ConditionalLockBufferForCleanup(Buffer buffer)
> Assert(refcount > 0);
> if (refcount != 1)
> return false;
> +
> + /*
> + * Check that the AIO subsystem doesn't have a pin. Likely not
> + * possible today, but better safe than sorry.
> + */
> + bufHdr = GetLocalBufferDescriptor(-buffer - 1);
> + buf_state = pg_atomic_read_u32(&bufHdr->state);
> + refcount = BUF_STATE_GET_REFCOUNT(buf_state);
> + Assert(refcount > 0);
> + if (refcount != 1)
> + return false;
> +
LockBufferForCleanup() should get code like this
ConditionalLockBufferForCleanup() code, either now or when "not possible
today" ends. Currently, it just assumes all local buffers are
cleanup-lockable:
/* Nobody else to wait for */
if (BufferIsLocal(buffer))
return;
> @@ -570,7 +577,13 @@ InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced)
>
> buf_state = pg_atomic_read_u32(&bufHdr->state);
>
> - if (check_unreferenced && LocalRefCount[bufid] != 0)
> + /*
> + * We need to test not just LocalRefCount[bufid] but also the BufferDesc
> + * itself, as the latter is used to represent a pin by the AIO subsystem.
> + * This can happen if AIO is initiated and then the query errors out.
> + */
> + if (check_unreferenced &&
> + (LocalRefCount[bufid] != 0 || BUF_STATE_GET_REFCOUNT(buf_state) != 0))
> elog(ERROR, "block %u of %s is still referenced (local %u)",
I didn't write a test to prove it, but I'm suspecting we'll reach the above
ERROR with this sequence:
CREATE TEMP TABLE foo ...;
[some command that starts reading a block of foo into local buffers, then ERROR with IO ongoing]
DROP TABLE foo;
DropRelationAllLocalBuffers() calls InvalidateLocalBuffer(bufHdr, true). I
think we'd need to do like pgaio_shutdown() and finish all IOs (or all IOs for
the particular rel) before InvalidateLocalBuffer(). Or use something like the
logic near elog(ERROR, "buffer is pinned in InvalidateBuffer") in
corresponding bufmgr code. I think that bufmgr ERROR is unreachable, since
only a private refcnt triggers that bufmgr ERROR. Is there something
preventing the localbuf error from being a problem? (This wouldn't require
changes to the current patch; responsibility would fall in a bufmgr AIO
patch.)
> Subject: [PATCH v2.11 09/27] bufmgr: Implement AIO read support
(Still reviewing this and later patches, but incidental observations follow.)
> +buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
> + bool failed, bool is_temp)
> +{
...
> + PgAioResult result;
...
> + result.status = PGAIO_RS_OK;
...
> + return result;
gcc 14.2.0 -Werror gives me:
bufmgr.c:7297:16: error: ‘result’ may be used uninitialized [-Werror=maybe-uninitialized]
Zeroing the unset fields silenced it:
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -7221,3 +7221,3 @@ buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
char *bufdata = BufferGetBlock(buffer);
- PgAioResult result;
+ PgAioResult result = { .status = PGAIO_RS_OK };
uint32 set_flag_bits;
@@ -7238,4 +7238,2 @@ buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
- result.status = PGAIO_RS_OK;
-
/* check for garbage data */
> Subject: [PATCH v2.11 13/27] aio: Basic read_stream adjustments for real AIO
> @@ -416,6 +418,13 @@ read_stream_start_pending_read(ReadStream *stream)
> static void
> read_stream_look_ahead(ReadStream *stream)
> {
> + /*
> + * Allow amortizing the cost of submitting IO over multiple IOs. This
> + * requires that we don't do any operations that could lead to a deadlock
> + * with staged-but-unsubmitted IO.
> + */
> + pgaio_enter_batchmode();
We call read_stream_get_block() while in batchmode, so the stream callback
needs to be ready for that. A complicated case is
collect_corrupt_items_read_stream_next_block(), which may do its own buffer
I/O to read in a vmbuffer for VM_ALL_FROZEN(). That's feeling to me like a
recipe for corner cases reaching ERROR "starting batch while batch already in
progress". Are there mitigating factors?
> Subject: [PATCH v2.11 17/27] aio: Add test_aio module
> + # verify that page verification errors are detected even as part of a
> + # shortened multi-block read (tbl_corr, block 1 is tbl_corred)
Is "tbl_corred" a typo of something?
> --- /dev/null
> +++ b/src/test/modules/test_aio/test_aio.c
> @@ -0,0 +1,657 @@
> +/*-------------------------------------------------------------------------
> + *
> + * delay_execution.c
> + * Test module to allow delay between parsing and execution of a query.
> + *
> + * The delay is implemented by taking and immediately releasing a specified
> + * advisory lock. If another process has previously taken that lock, the
> + * current process will be blocked until the lock is released; otherwise,
> + * there's no effect. This allows an isolationtester script to reliably
> + * test behaviors where some specified action happens in another backend
> + * between parsing and execution of any desired query.
> + *
> + * Copyright (c) 2020-2025, PostgreSQL Global Development Group
> + *
> + * IDENTIFICATION
> + * src/test/modules/delay_execution/delay_execution.c
Header comment is surviving from copy-paste of delay_execution.c.
> + * Tor tests we don't want the resowner release preventing us from
s/Tor/For/
From | Date | Subject | |
---|---|---|---|
Next Message | Richard Guo | 2025-03-24 00:50:36 | Re: Fix infinite loop from setting scram_iterations to INT_MAX |
Previous Message | Michael Paquier | 2025-03-23 23:52:33 | Re: Proposal - Allow extensions to set a Plan Identifier |