Re: AIO v2.5

From: Andres Freund <andres(at)anarazel(dot)de>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Antonin Houska <ah(at)cybertec(dot)at>, pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
Subject: Re: AIO v2.5
Date: 2025-03-19 22:17:37
Message-ID: cz6eu7prqdibz35nljpqwyzyp37afncsp4r3fk4lkssiecdjck@vwew2c2vs2ar
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2025-03-19 14:25:30 -0700, Noah Misch wrote:
> On Wed, Mar 12, 2025 at 01:06:03PM -0400, Andres Freund wrote:
> > - Right now effective_io_concurrency cannot be set > 0 on Windows and other
> > platforms that lack posix_fadvise. But with AIO we can read ahead without
> > posix_fadvise().
> >
> > It'd not really make anything worse than today to not remove the limit, but
> > it'd be pretty weird to prevent windows etc from benefiting from AIO. Need
> > to look around and see whether it would require anything other than doc
> > changes.
>
> Worth changing, but non-blocking.

Thankfully Melanie submitted a patch for that...

> Other than the smgr patch review sent on its own thread, I've not yet reviewed
> any of these patches comprehensively. Given the speed of change, I felt it
> was time to flush comments buffered since 2025-03-11:

Thanks!

> commit 0284401 wrote:
> > aio: Basic subsystem initialization
>
> > @@ -465,6 +466,7 @@ AutoVacLauncherMain(const void *startup_data, size_t startup_data_len)
> > */
> > LWLockReleaseAll();
> > pgstat_report_wait_end();
> > + pgaio_error_cleanup();
>
> AutoVacLauncherMain(), BackgroundWriterMain(), CheckpointerMain(), and
> WalWriterMain() call AtEOXact_Buffers() but not AtEOXact_Aio(). Is that
> proper? They do call pgaio_error_cleanup() as seen here, so the only loss is
> some asserts. (The load-bearing part does get done.)

I don't think it's particularly good that we use the AtEOXact_* functions in
the sigsetjmp blocks, that feels like a weird mixup of infrastructure to
me. So this was intentional.

> commit da72269 wrote:
> > aio: Add core asynchronous I/O infrastructure
>
> > + * This could be in aio_internal.h, as it is not pubicly referenced, but
>
> typo -> publicly

/me has a red face.

> commit 55b454d wrote:
> > aio: Infrastructure for io_method=worker
>
> > + /* Try to launch one. */
> > + child = StartChildProcess(B_IO_WORKER);
> > + if (child != NULL)
> > + {
> > + io_worker_children[id] = child;
> > + ++io_worker_count;
> > + }
> > + else
> > + break; /* XXX try again soon? */
>
> I'd change the comment to something like one of:
>
> retry after DetermineSleepTime()
> next LaunchMissingBackgroundProcesses() will retry in <60s

Hm, we retry more frequently that that if there are new connections... Maybe
just "try again next time"?

> On Tue, Mar 18, 2025 at 04:12:18PM -0400, Andres Freund wrote:
> > - Decide what to do about the smgr interrupt issue
>
> Replied on that thread. It's essentially ready.

Cool, will reply there in a bit.

> > Subject: [PATCH v2.10 08/28] bufmgr: Implement AIO read support
>
> Some comments about BM_IO_IN_PROGRESS may need updates. This paragraph:
>
> * The BM_IO_IN_PROGRESS flag acts as a kind of lock, used to wait for I/O on a
> buffer to complete (and in releases before 14, it was accompanied by a
> per-buffer LWLock). The process doing a read or write sets the flag for the
> duration, and processes that need to wait for it to be cleared sleep on a
> condition variable.

First draft:
* The BM_IO_IN_PROGRESS flag acts as a kind of lock, used to wait for I/O on a
buffer to complete (and in releases before 14, it was accompanied by a
per-buffer LWLock). The process start a read or write sets the flag. When the
I/O is completed, be it by the process that initiated the I/O or by another
process, the flag is removed and the Buffer's condition variable is signalled.
Processes that need to wait for the I/O to complete can wait for asynchronous
I/O to using BufferDesc->io_wref and for BM_IO_IN_PROGRESS to be unset by
sleeping on the buffer's condition variable.

> And these individual lines from "git grep BM_IO_IN_PROGRESS":
>
> * i.e at most one BM_IO_IN_PROGRESS bit is set per proc.
>
> The last especially.

Huh - yea. This isn't a "new" issue, I think I missed this comment in 16's
12f3867f5534. I think the comment can just be deleted?

> * I/O already in progress. We already hold BM_IO_IN_PROGRESS for the
> * only one process at a time can set the BM_IO_IN_PROGRESS bit.
> * only one process at a time can set the BM_IO_IN_PROGRESS bit.

> For the other three lines and the paragraph, the notion
> of a process "holding" BM_IO_IN_PROGRESS or being the process to "set" it or
> being the process "doing a read" becomes less significant when one process
> starts the IO and another completes it.

Hm. I think they'd be ok as-is, but we can probably improve them. Maybe

* Now it's safe to write buffer to disk. Note that no one else should
* have been able to write it while we were busy with log flushing because
* we got the exclusive right to perform I/O by setting the
* BM_IO_IN_PROGRESS bit.

> > + /* we better have ensured the buffer is present until now */
> > + Assert(BUF_STATE_GET_REFCOUNT(buf_state) >= 1);
>
> I'd delete that comment; to me, the assertion alone is clearer.

Ok.

> > + ereport(LOG,
> > + (errcode(ERRCODE_DATA_CORRUPTED),
> > + errmsg("invalid page in block %u of relation %s; zeroing out page",
>
> This is changing level s/WARNING/LOG/. That seems orthogonal to the patch's
> goals; is it needed? If so, I recommend splitting it out as a preliminary
> patch, to highlight the behavior change for release notes.

No, it's not needed. I think I looked over the patch at some point and
considered the log-level wrong according to our guidelines and thought I'd
broken it.

> > + /*
> > + * If the entire failed on a lower-level, each buffer needs to be
>
> Missing word, probably fix like:
> s,entire failed on a lower-level,entire I/O failed on a lower level,

Yep.

> > + * marked as failed. In case of a partial read, some buffers may be
> > + * ok.
> > + */
> > + failed =
> > + prior_result.status == ARS_ERROR
> > + || prior_result.result <= buf_off;
>
> I didn't run an experiment to check the following, but I think this should be
> s/<=/</. Suppose we requested two blocks and read some amount of bytes
> [1*BLCKSZ, 2*BLSCKSZ - 1]. md_readv_complete will store result=1. buf_off==0
> should compute failed=false here, but buf_off==1 should compute failed=true.

Huh, you might be right. I thought I wrote a test for this, I wonder why it
didn't catch the problem...

> I see this relies on md_readv_complete having converted "result" to blocks.
> Was there some win from doing that as opposed to doing the division here?
> Division here ("blocks_read = prior_result.result / BLCKSZ") would feel easier
> to follow, to me.

It seemed like that would be wrong layering - what if we had an smgr that
could store data in a compressed format? The raw read would be of a smaller
size. The smgr API deals in BlockNumbers, only the md.c layer should know
about bytes.

> > +
> > + buf_result = buffer_readv_complete_one(buf_off, buf, cb_data, failed,
> > + is_temp);
> > +
> > + /*
> > + * If there wasn't any prior error and the IO for this page failed in
> > + * some form, set the whole IO's to the page's result.
>
> s/the IO for this page/page verification/
> s/IO's/IO's result/

Agreed.

Thanks for the review!

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2025-03-19 22:17:40 Re: Statistics Import and Export
Previous Message Thomas Munro 2025-03-19 22:14:29 Re: [PoC] Federated Authn/z with OAUTHBEARER