Re: AIO v2.5

From: Andres Freund <andres(at)anarazel(dot)de>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Antonin Houska <ah(at)cybertec(dot)at>, pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
Subject: Re: AIO v2.5
Date: 2025-03-20 18:54:14
Message-ID: 7co5jtx2jkvpg5pde7f5wyub6w6zptfw4sbuj4n5j4iedj5zlg@u3mzxsgopnhl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2025-03-19 18:11:18 -0700, Noah Misch wrote:
> On Wed, Mar 19, 2025 at 06:17:37PM -0400, Andres Freund wrote:
> > On 2025-03-19 14:25:30 -0700, Noah Misch wrote:

> > Hm, we retry more frequently that that if there are new connections... Maybe
> > just "try again next time"?
>
> Works for me.
>

> > > And these individual lines from "git grep BM_IO_IN_PROGRESS":
> > >
> > > * i.e at most one BM_IO_IN_PROGRESS bit is set per proc.
> > >
> > > The last especially.
> >
> > Huh - yea. This isn't a "new" issue, I think I missed this comment in 16's
> > 12f3867f5534. I think the comment can just be deleted?
>
> Hmm, yes, it's orthogonal to $SUBJECT and deletion works fine.
>
> > > * I/O already in progress. We already hold BM_IO_IN_PROGRESS for the
> > > * only one process at a time can set the BM_IO_IN_PROGRESS bit.
> > > * only one process at a time can set the BM_IO_IN_PROGRESS bit.
> >
> > > For the other three lines and the paragraph, the notion
> > > of a process "holding" BM_IO_IN_PROGRESS or being the process to "set" it or
> > > being the process "doing a read" becomes less significant when one process
> > > starts the IO and another completes it.
> >
> > Hm. I think they'd be ok as-is, but we can probably improve them. Maybe
>
> Looking again, I agree they're okay.
>
> >
> > * Now it's safe to write buffer to disk. Note that no one else should
> > * have been able to write it while we were busy with log flushing because
> > * we got the exclusive right to perform I/O by setting the
> > * BM_IO_IN_PROGRESS bit.
>
> That's fine too. Maybe s/perform/stage/ or s/perform/start/.

I put these comment changes into their own patch, as it seemed confusing to
change them as part of one of the already queued commits.

> > > I see this relies on md_readv_complete having converted "result" to blocks.
> > > Was there some win from doing that as opposed to doing the division here?
> > > Division here ("blocks_read = prior_result.result / BLCKSZ") would feel easier
> > > to follow, to me.
> >
> > It seemed like that would be wrong layering - what if we had an smgr that
> > could store data in a compressed format? The raw read would be of a smaller
> > size. The smgr API deals in BlockNumbers, only the md.c layer should know
> > about bytes.
>
> I hadn't thought of that. That's a good reason.

I thought that was better documented, but alas, it wasn't. How about updating
the documentation of smgrstartreadv to the following:

/*
* smgrstartreadv() -- asynchronous version of smgrreadv()
*
* This starts an asynchronous readv IO using the IO handle `ioh`. Other than
* `ioh` all parameters are the same as smgrreadv().
*
* Completion callbacks above smgr will be passed the result as the number of
* successfully read blocks if the read [partially] succeeds. This maintains
* the abstraction that smgr operates on the level of blocks, rather than
* bytes.
*/

I briefly had a bug in test_aio's injection point that lead to *increasing*
the number of bytes successfully read. That triggered an assertion failure in
bufmgr.c, but not closer to the problem. Is it worth adding an assert against
that to md_readv_complete? Can't quite decide.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David G. Johnston 2025-03-20 19:46:29 Have postgres.bki self-identify
Previous Message Andrei Lepikhov 2025-03-20 18:54:07 Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment