Re: AIO v2.5

From: Noah Misch <noah(at)leadboat(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Antonin Houska <ah(at)cybertec(dot)at>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
Subject: Re: AIO v2.5
Date: 2025-03-25 15:58:08
Message-ID: 20250325155808.f7.nmisch@google.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 25, 2025 at 11:26:14AM -0400, Andres Freund wrote:
> On 2025-03-25 06:33:21 -0700, Noah Misch wrote:
> > On Mon, Mar 24, 2025 at 10:30:27PM -0400, Andres Freund wrote:
> > > On 2025-03-24 17:45:37 -0700, Noah Misch wrote:
> > > > (We may be due for a test mode that does smgrreleaseall() at every
> > > > CHECK_FOR_INTERRUPTS()?)
> > >
> > > I suspect we are. I'm a bit afraid of even trying...
> > >
> > > ...
> > >
> > > It's extremely slow - but at least the main regression as well as the aio tests pass!
> >
> > One less thing!
>
> Unfortunately I'm now doubting the thoroughness of my check - while I made
> every CFI() execute smgrreleaseall(), I didn't trigger CFI() in cases where we
> trigger it conditionally. E.g. elog(DEBUGN, ...) only executes a CFI if
> log_min_messages <= DEBUGN...
>
> I'll try that in a bit.

While having nagging thoughts that we might be releasing FDs before io_uring
gets them into kernel custody, I tried this hack to maximize FD turnover:

static void
ReleaseLruFiles(void)
{
#if 0
while (nfile + numAllocatedDescs + numExternalFDs >= max_safe_fds)
{
if (!ReleaseLruFile())
break;
}
#else
while (ReleaseLruFile())
;
#endif
}

"make check" with default settings (io_method=worker) passes, but
io_method=io_uring in the TEMP_CONFIG file got different diffs in each of two
runs. s/#if 0/#if 1/ (restore normal FD turnover) removes the failures.
Here's the richer of the two diffs:

diff -U3 src/test/regress/expected/sanity_check.out src/test/regress/results/sanity_check.out
--- src/test/regress/expected/sanity_check.out 2024-10-24 12:43:25.741817594 -0700
+++ src/test/regress/results/sanity_check.out 2025-03-25 08:27:44.875151566 -0700
@@ -1,4 +1,7 @@
VACUUM;
+ERROR: index "pg_enum_oid_index" contains corrupted page at block 2
+HINT: Please REINDEX it.
+CONTEXT: while vacuuming index "pg_enum_oid_index" of relation "pg_catalog.pg_enum"
--
-- Sanity check: every system catalog that has OIDs should have
-- a unique index on OID. This ensures that the OIDs will be unique,
diff -U3 src/test/regress/expected/oidjoins.out src/test/regress/results/oidjoins.out
--- src/test/regress/expected/oidjoins.out 2023-07-06 19:58:07.686364439 -0700
+++ src/test/regress/results/oidjoins.out 2025-03-25 08:28:02.584335458 -0700
@@ -233,6 +233,8 @@
NOTICE: checking pg_policy {polrelid} => pg_class {oid}
NOTICE: checking pg_policy {polroles} => pg_authid {oid}
NOTICE: checking pg_default_acl {defaclrole} => pg_authid {oid}
+WARNING: FK VIOLATION IN pg_default_acl({defaclrole}): ("(1,5)",0)
+WARNING: FK VIOLATION IN pg_default_acl({defaclrole}): ("(1,7)",402654464)
NOTICE: checking pg_default_acl {defaclnamespace} => pg_namespace {oid}
NOTICE: checking pg_init_privs {classoid} => pg_class {oid}
NOTICE: checking pg_seclabel {classoid} => pg_class {oid}

> > > Because the end state varies, depending on the number of previously staged
> > > IOs, the IO method and whether batchmode is enabled, I think it's better if
> > > the "function naming pattern" (i.e. FileStartReadv, smgrstartreadv etc) is
> > > *not* aligned with an internal state name. It will just mislead readers to
> > > think that there's a deterministic mapping when that does not exist.
> >
> > That's fair. Could we provide the mapping in a comment, something like the
> > following?
>
> Yes!
>
> I wonder if it should also be duplicated or referenced elsewhere, although I
> am not sure where precisely.

I considered the README.md also, but adding that wasn't an obvious win.

> > --- a/src/include/storage/aio_internal.h
> > +++ b/src/include/storage/aio_internal.h
> > @@ -34,5 +34,10 @@
> > * linearly through all states.
> > *
> > - * State changes should all go through pgaio_io_update_state().
> > + * State changes should all go through pgaio_io_update_state(). Its callers
> > + * use these naming conventions:
> > + *
> > + * - A "start" function (e.g. FileStartReadV()) moves an IO from
> > + * PGAIO_HS_HANDED_OUT to at least PGAIO_HS_STAGED and at most
> > + * PGAIO_HS_COMPLETED_LOCAL.
> > */
> > typedef enum PgAioHandleState
>
> One detail I'm not sure about: The above change is correct, but perhaps a bit
> misleading, because we can actually go "back" to IDLE. Not sure how to best
> phrase that though.

Not sure either. Maybe the above could change to "to PGAIO_HS_STAGED or any
subsequent state" and the comment at PGAIO_HS_STAGED could say like "Once in
this state, concurrent activity could move the IO all the way to
PGAIO_HS_COMPLETED_LOCAL and recycle it back to IDLE."

> > > That's not an excuse for pgaio_io_prep* though, that's a pointlessly different
> > > naming that I just stopped seeing.
>
> I assume you're on board with renaming _io_prep* to _io_start_*?

Yes.

> > > I'll try to think more about this, perhaps I can make myself see your POV
> > > more.

> > CheckBufferIsPinnedOnce(Buffer buffer)
> > {
> > if (BufferIsLocal(buffer))
> > {
> > if (LocalRefCount[-buffer - 1] != 1)
> > elog(ERROR, "incorrect local pin count: %d",
> > LocalRefCount[-buffer - 1]);
> > }
> > else
> > {
> > if (GetPrivateRefCount(buffer) != 1)
> > elog(ERROR, "incorrect local pin count: %d",
> > GetPrivateRefCount(buffer));
> > }
> > }
>
> Pretty random orthogonal thought, that I was reminded of by the above code
> snippet:
>
> It sure seems we should at some point get rid of LocalRefCount[] and just use
> the GetPrivateRefCount() infrastructure for both shared and local buffers. I
> don't think the GetPrivateRefCount() infrastructure cares about
> local/non-local, leaving a few asserts aside. If we do that, and start to use
> BM_IO_IN_PROGRESS, combined with ResourceOwnerRememberBufferIO(), the set of
> differences between shared and local buffers would be a lot smaller.

That sounds promising.

> > > > Like the comment, I expect it's academic today. I expect it will stay
> > > > academic. Anything that does a cleanup will start by reading the buffer,
> > > > which will resolve any refcnt the AIO subsystems holds for a read. If there's
> > > > an AIO write, the LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE) will block on
> > > > that. How about just removing the ConditionalLockBufferForCleanup() changes
> > > > or replacing them with a comment (like the present paragraph)?
> > >
> > > I think we'll need an expanded version of what I suggest once we have writes -
> > > but as you say, it shouldn't matter as long as we only have reads. So I think
> > > moving the relevant changes, with adjusted caveats, to the bufmgr: write
> > > change makes sense.
> >
> > Moving those changes works for me. I'm not currently seeing the need under
> > writes, but that may get clearer upon reaching those patches.
>
> FWIW, I don't think it's currently worth looking at the write side in detail,

Got it. (I meant I didn't see a first-principles need, not that I had deduced
lack of need from a specific writes implementation.)

> > > Do you think it's worth mentioning the above workaround? I'm mildly inclined
> > > that not.
> >
> > Perhaps not in that detail, but perhaps we can rephrase (b) to not imply
> > exit+reenter is banned. Maybe "(b) start another batch (without first exiting
> > one)". It's also fine as-is, though.
>
> I updated it to:
>
> * b) start another batch (without first exiting batchmode and re-entering
> * before returning)

That's good.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Álvaro Herrera 2025-03-25 16:02:49 simplifying grammar for ALTER CONSTRAINT .. SET [NO] INHERIT
Previous Message Andres Freund 2025-03-25 15:57:58 Re: AIO v2.5