From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Noah Misch <noah(at)leadboat(dot)com> |
Cc: | Antonin Houska <ah(at)cybertec(dot)at>, pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Jelte Fennema-Nio <postgres(at)jeltef(dot)nl> |
Subject: | Re: AIO v2.5 |
Date: | 2025-03-20 18:54:14 |
Message-ID: | 7co5jtx2jkvpg5pde7f5wyub6w6zptfw4sbuj4n5j4iedj5zlg@u3mzxsgopnhl |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2025-03-19 18:11:18 -0700, Noah Misch wrote:
> On Wed, Mar 19, 2025 at 06:17:37PM -0400, Andres Freund wrote:
> > On 2025-03-19 14:25:30 -0700, Noah Misch wrote:
> > Hm, we retry more frequently that that if there are new connections... Maybe
> > just "try again next time"?
>
> Works for me.
>
> > > And these individual lines from "git grep BM_IO_IN_PROGRESS":
> > >
> > > * i.e at most one BM_IO_IN_PROGRESS bit is set per proc.
> > >
> > > The last especially.
> >
> > Huh - yea. This isn't a "new" issue, I think I missed this comment in 16's
> > 12f3867f5534. I think the comment can just be deleted?
>
> Hmm, yes, it's orthogonal to $SUBJECT and deletion works fine.
>
> > > * I/O already in progress. We already hold BM_IO_IN_PROGRESS for the
> > > * only one process at a time can set the BM_IO_IN_PROGRESS bit.
> > > * only one process at a time can set the BM_IO_IN_PROGRESS bit.
> >
> > > For the other three lines and the paragraph, the notion
> > > of a process "holding" BM_IO_IN_PROGRESS or being the process to "set" it or
> > > being the process "doing a read" becomes less significant when one process
> > > starts the IO and another completes it.
> >
> > Hm. I think they'd be ok as-is, but we can probably improve them. Maybe
>
> Looking again, I agree they're okay.
>
> >
> > * Now it's safe to write buffer to disk. Note that no one else should
> > * have been able to write it while we were busy with log flushing because
> > * we got the exclusive right to perform I/O by setting the
> > * BM_IO_IN_PROGRESS bit.
>
> That's fine too. Maybe s/perform/stage/ or s/perform/start/.
I put these comment changes into their own patch, as it seemed confusing to
change them as part of one of the already queued commits.
> > > I see this relies on md_readv_complete having converted "result" to blocks.
> > > Was there some win from doing that as opposed to doing the division here?
> > > Division here ("blocks_read = prior_result.result / BLCKSZ") would feel easier
> > > to follow, to me.
> >
> > It seemed like that would be wrong layering - what if we had an smgr that
> > could store data in a compressed format? The raw read would be of a smaller
> > size. The smgr API deals in BlockNumbers, only the md.c layer should know
> > about bytes.
>
> I hadn't thought of that. That's a good reason.
I thought that was better documented, but alas, it wasn't. How about updating
the documentation of smgrstartreadv to the following:
/*
* smgrstartreadv() -- asynchronous version of smgrreadv()
*
* This starts an asynchronous readv IO using the IO handle `ioh`. Other than
* `ioh` all parameters are the same as smgrreadv().
*
* Completion callbacks above smgr will be passed the result as the number of
* successfully read blocks if the read [partially] succeeds. This maintains
* the abstraction that smgr operates on the level of blocks, rather than
* bytes.
*/
I briefly had a bug in test_aio's injection point that lead to *increasing*
the number of bytes successfully read. That triggered an assertion failure in
bufmgr.c, but not closer to the problem. Is it worth adding an assert against
that to md_readv_complete? Can't quite decide.
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | David G. Johnston | 2025-03-20 19:46:29 | Have postgres.bki self-identify |
Previous Message | Andrei Lepikhov | 2025-03-20 18:54:07 | Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment |