From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Noah Misch <noah(at)leadboat(dot)com> |
Cc: | Antonin Houska <ah(at)cybertec(dot)at>, pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Jelte Fennema-Nio <postgres(at)jeltef(dot)nl> |
Subject: | Re: AIO v2.5 |
Date: | 2025-03-19 22:17:37 |
Message-ID: | cz6eu7prqdibz35nljpqwyzyp37afncsp4r3fk4lkssiecdjck@vwew2c2vs2ar |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2025-03-19 14:25:30 -0700, Noah Misch wrote:
> On Wed, Mar 12, 2025 at 01:06:03PM -0400, Andres Freund wrote:
> > - Right now effective_io_concurrency cannot be set > 0 on Windows and other
> > platforms that lack posix_fadvise. But with AIO we can read ahead without
> > posix_fadvise().
> >
> > It'd not really make anything worse than today to not remove the limit, but
> > it'd be pretty weird to prevent windows etc from benefiting from AIO. Need
> > to look around and see whether it would require anything other than doc
> > changes.
>
> Worth changing, but non-blocking.
Thankfully Melanie submitted a patch for that...
> Other than the smgr patch review sent on its own thread, I've not yet reviewed
> any of these patches comprehensively. Given the speed of change, I felt it
> was time to flush comments buffered since 2025-03-11:
Thanks!
> commit 0284401 wrote:
> > aio: Basic subsystem initialization
>
> > @@ -465,6 +466,7 @@ AutoVacLauncherMain(const void *startup_data, size_t startup_data_len)
> > */
> > LWLockReleaseAll();
> > pgstat_report_wait_end();
> > + pgaio_error_cleanup();
>
> AutoVacLauncherMain(), BackgroundWriterMain(), CheckpointerMain(), and
> WalWriterMain() call AtEOXact_Buffers() but not AtEOXact_Aio(). Is that
> proper? They do call pgaio_error_cleanup() as seen here, so the only loss is
> some asserts. (The load-bearing part does get done.)
I don't think it's particularly good that we use the AtEOXact_* functions in
the sigsetjmp blocks, that feels like a weird mixup of infrastructure to
me. So this was intentional.
> commit da72269 wrote:
> > aio: Add core asynchronous I/O infrastructure
>
> > + * This could be in aio_internal.h, as it is not pubicly referenced, but
>
> typo -> publicly
/me has a red face.
> commit 55b454d wrote:
> > aio: Infrastructure for io_method=worker
>
> > + /* Try to launch one. */
> > + child = StartChildProcess(B_IO_WORKER);
> > + if (child != NULL)
> > + {
> > + io_worker_children[id] = child;
> > + ++io_worker_count;
> > + }
> > + else
> > + break; /* XXX try again soon? */
>
> I'd change the comment to something like one of:
>
> retry after DetermineSleepTime()
> next LaunchMissingBackgroundProcesses() will retry in <60s
Hm, we retry more frequently that that if there are new connections... Maybe
just "try again next time"?
> On Tue, Mar 18, 2025 at 04:12:18PM -0400, Andres Freund wrote:
> > - Decide what to do about the smgr interrupt issue
>
> Replied on that thread. It's essentially ready.
Cool, will reply there in a bit.
> > Subject: [PATCH v2.10 08/28] bufmgr: Implement AIO read support
>
> Some comments about BM_IO_IN_PROGRESS may need updates. This paragraph:
>
> * The BM_IO_IN_PROGRESS flag acts as a kind of lock, used to wait for I/O on a
> buffer to complete (and in releases before 14, it was accompanied by a
> per-buffer LWLock). The process doing a read or write sets the flag for the
> duration, and processes that need to wait for it to be cleared sleep on a
> condition variable.
First draft:
* The BM_IO_IN_PROGRESS flag acts as a kind of lock, used to wait for I/O on a
buffer to complete (and in releases before 14, it was accompanied by a
per-buffer LWLock). The process start a read or write sets the flag. When the
I/O is completed, be it by the process that initiated the I/O or by another
process, the flag is removed and the Buffer's condition variable is signalled.
Processes that need to wait for the I/O to complete can wait for asynchronous
I/O to using BufferDesc->io_wref and for BM_IO_IN_PROGRESS to be unset by
sleeping on the buffer's condition variable.
> And these individual lines from "git grep BM_IO_IN_PROGRESS":
>
> * i.e at most one BM_IO_IN_PROGRESS bit is set per proc.
>
> The last especially.
Huh - yea. This isn't a "new" issue, I think I missed this comment in 16's
12f3867f5534. I think the comment can just be deleted?
> * I/O already in progress. We already hold BM_IO_IN_PROGRESS for the
> * only one process at a time can set the BM_IO_IN_PROGRESS bit.
> * only one process at a time can set the BM_IO_IN_PROGRESS bit.
> For the other three lines and the paragraph, the notion
> of a process "holding" BM_IO_IN_PROGRESS or being the process to "set" it or
> being the process "doing a read" becomes less significant when one process
> starts the IO and another completes it.
Hm. I think they'd be ok as-is, but we can probably improve them. Maybe
* Now it's safe to write buffer to disk. Note that no one else should
* have been able to write it while we were busy with log flushing because
* we got the exclusive right to perform I/O by setting the
* BM_IO_IN_PROGRESS bit.
> > + /* we better have ensured the buffer is present until now */
> > + Assert(BUF_STATE_GET_REFCOUNT(buf_state) >= 1);
>
> I'd delete that comment; to me, the assertion alone is clearer.
Ok.
> > + ereport(LOG,
> > + (errcode(ERRCODE_DATA_CORRUPTED),
> > + errmsg("invalid page in block %u of relation %s; zeroing out page",
>
> This is changing level s/WARNING/LOG/. That seems orthogonal to the patch's
> goals; is it needed? If so, I recommend splitting it out as a preliminary
> patch, to highlight the behavior change for release notes.
No, it's not needed. I think I looked over the patch at some point and
considered the log-level wrong according to our guidelines and thought I'd
broken it.
> > + /*
> > + * If the entire failed on a lower-level, each buffer needs to be
>
> Missing word, probably fix like:
> s,entire failed on a lower-level,entire I/O failed on a lower level,
Yep.
> > + * marked as failed. In case of a partial read, some buffers may be
> > + * ok.
> > + */
> > + failed =
> > + prior_result.status == ARS_ERROR
> > + || prior_result.result <= buf_off;
>
> I didn't run an experiment to check the following, but I think this should be
> s/<=/</. Suppose we requested two blocks and read some amount of bytes
> [1*BLCKSZ, 2*BLSCKSZ - 1]. md_readv_complete will store result=1. buf_off==0
> should compute failed=false here, but buf_off==1 should compute failed=true.
Huh, you might be right. I thought I wrote a test for this, I wonder why it
didn't catch the problem...
> I see this relies on md_readv_complete having converted "result" to blocks.
> Was there some win from doing that as opposed to doing the division here?
> Division here ("blocks_read = prior_result.result / BLCKSZ") would feel easier
> to follow, to me.
It seemed like that would be wrong layering - what if we had an smgr that
could store data in a compressed format? The raw read would be of a smaller
size. The smgr API deals in BlockNumbers, only the md.c layer should know
about bytes.
> > +
> > + buf_result = buffer_readv_complete_one(buf_off, buf, cb_data, failed,
> > + is_temp);
> > +
> > + /*
> > + * If there wasn't any prior error and the IO for this page failed in
> > + * some form, set the whole IO's to the page's result.
>
> s/the IO for this page/page verification/
> s/IO's/IO's result/
Agreed.
Thanks for the review!
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Jeff Davis | 2025-03-19 22:17:40 | Re: Statistics Import and Export |
Previous Message | Thomas Munro | 2025-03-19 22:14:29 | Re: [PoC] Federated Authn/z with OAUTHBEARER |