Re: AIO v2.2

From: Noah Misch <noah(at)leadboat(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: AIO v2.2
Date: 2025-01-07 19:10:59
Message-ID: 20250107191059.24.nmisch@google.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jan 06, 2025 at 04:40:26PM -0500, Andres Freund wrote:
> On 2025-01-06 10:52:20 -0800, Noah Misch wrote:
> > On Tue, Dec 31, 2024 at 11:03:33PM -0500, Andres Freund wrote:

> - We have pretty no testing for IO errors.

Yes, that's remained a gap. I've wondered how much to address this via
targeted tests of specific sites vs. fuzzing, iterative fault injection, or
some other approach closer to brute force.

> > I'd be most interested in the
> > cases that would be undetected deadlocks under a naive design. An example
> > appeared at the end of postgr.es/m/20240916144349.74.nmisch@google.com
>
> That's a good one, yea.
>
> I think I'll try to translate the regression tests I wrote into an isolation
> test, I hope that'll make it a bit easier to cover more cases.
>
> And then we'll need more injection points, I'm afraid :(.

Sounds good.

> > * - method_*.c - different ways of executing AIO (e.g. worker process)
> > * - aio_io.c - method-independent code for specific IO ops (e.g. readv)
> > * - aio_subject.c - callbacks at IO operation lifecycle events
> > * - aio_init.c - per-fork and per-startup-process initialization
>
> I don't particularly like "per-startup-process", because "global
> initialization" really is separate (and precedes) from startup processes
> startup. Maybe "per-server and per-backend initialization"?

That works for me. I wrote "per-startup-process" because it can happen more
than once in a postmaster that reaches "all server processes terminated;
reinitializing". That said, there's little risk of "per-server" giving folks
a materially wrong idea.

> > * - aio.c - all other topics
> > * - read_stream.c - helper for reading buffered relation data
>
> Did the order you listed the files have a system to it? If so, what is it?

The rough idea was to avoid forward references:

* - method_*.c - different ways of executing AIO (e.g. worker process)
makes sense without other background
* - aio_io.c - method-independent code for specific IO ops (e.g. readv)
refers to methods, so listed after methods
* - aio_subject.c - callbacks at IO operation lifecycle events
refers to IO ops, so listed after aio_io.c
* - aio_init.c - per-fork and per-startup-process initialization
no surprise that this code will exist somewhere, so list it lower to deemphasize it
* - aio.c - all other topics
default route, hence last
* - read_stream.c - helper for reading buffered relation data
could just as easily come first, not last
could be under a distinct heading like "Recommended abstractions:"

> > I'd benefit from seeing things in this order:
> >
> > - "why"
> > - condensed usage example like manpage SYNOPSIS, comments and decls removed
> > - PgAioHandleState and discussion of valid transitions
>
> Hm - why have PgAioHandleState and its states before the usage example? Seems
> like it'd be harder to understand that way.

I usually look at the data structures before the code that manipulates them.
(Similarly, I look at the map before the directions.) I wouldn't mind it
appearing after the usage example, since order preferences do vary.

> > - usage example as it is, with full comments
> > - the rest
>
>
> > ## Synopsis
> >
> > ioh = pgaio_io_get(CurrentResourceOwner, &ioret);
> > pgaio_io_get_ref(ioh, &ior);
> > pgaio_io_add_shared_cb(ioh, ASC_SHARED_BUFFER_READ);
> > pgaio_io_set_io_data_32(ioh, (uint32 *) buffer, 1);
> > smgrstartreadv(ioh, operation->smgr, forknum, blkno,
> > BufferGetBlock(buffer), 1);
> > pgaio_submit_staged();
> > pgaio_io_ref_wait(&ior);
> > if (ioret.result.status == ARS_ERROR)
> > pgaio_result_log(aio_ret.result, &aio_ret.subject_data, ERROR);
>
> Happy to add this, but I'm not entirely sure if that's really that useful to
> have without commentary? The synopsis in manpages is helpful because it
> provides the signature of various functions, but this wouldn't...

I'm not sure either. Let's drop that idea.

> > > +### IO can be started in critical sections
> > ...
> > > +The need to be able to execute IO in critical sections has substantial design
> > > +implication on the AIO subsystem. Mainly because completing IOs (see prior
> > > +section) needs to be possible within a critical section, even if the
> > > +to-be-completed IO itself was not issued in a critical section. Consider
> > > +e.g. the case of a backend first starting a number of writes from shared
> > > +buffers and then starting to flush the WAL. Because only a limited amount of
> > > +IO can be in-progress at the same time, initiating the IO for flushing the WAL
> > > +may require to first finish executing IO executed earlier.
> >
> > The last line's two appearances of the word "execute" read awkwardly to me,
> > and it's an opportunity to use PgAioHandleState terms. Consider writing the
> > last line like "may first advance an existing IO from AHS_PREPARED to
> > AHS_COMPLETED_SHARED".
>
> It is indeed awkward. I don't love referencing the state-constants here
> though, somehow that feels like a reference-cycle ;). What about this:
>
> > ... Consider
> > e.g. the case of a backend first starting a number of writes from shared
> > buffers and then starting to flush the WAL. Because only a limited amount of
> > IO can be in-progress at the same time, initiating IO for flushing the WAL may
> > require to first complete IO that was started earlier.

That's non-awkward. I like specific state names here since "complete" could
mean AHS_COMPLETED_SHARED or AHS_COMPLETED_LOCAL, and it matters here. If the
state names changed so AHS_COMPLETED_LOCAL dropped the word "complete", that
too would solve it.

> > > +### AIO Callbacks
> > ...
> > > +In addition to completion, AIO callbacks also are called to "prepare" an
> > > +IO. This is, e.g., used to acquire buffer pins owned by the AIO subsystem for
> > > +IO to/from shared buffers, which is required to handle the case where the
> > > +issuing backend errors out and releases its own pins.
> >
> > Reading this, it's not obvious to me how to reconcile "finishing an IO could
> > require pin acquisition" with "finishing an IO could happen in a critical
> > section". Pinning a buffer in a critical section sounds bad. I vaguely
> > recall understanding how it was okay as of my September review, but I've
> > already forgotten. Can this text have a sentence making that explicit?
>
> Ah, yes, that's easy to misunderstand. The answer basically is that we don't
> newly pin a buffer, we just increment the reference count by 1. That should
> never fail.
>
> How about:
> > In addition to completion, AIO callbacks also are called to "prepare" an
> > IO. This is, e.g., used to increase buffer reference counts to account for the
> > AIO subsystem referencing the buffer, which is required to handle the case
> > where the issuing backend errors out and releases its own pins while the IO is
> > still ongoing.

Perfect.

> > > +### AIO Subjects
> > > +
> > > +In addition to the completion callbacks describe above, each AIO Handle has
> > > +exactly one "subject". Each subject has some space inside an AIO Handle with
> > > +information specific to the subject and can provide callbacks to allow to
> > > +reopen the underlying file (required for worker mode) and to describe the IO
> > > +operation (used for debug logging and error messages).
> >
> > Can this say roughly how to decide when to add a new subject?
>
> Hm, there obviously is some fuzziness. I was trying to get to some of that by
> mentioning that the subject needs to know how to [re-]open a file and describe
> the target of the IO in terms that make sense to the user.
>
> E.g. smgr seemed to make sense as a subject as the smgr layer knows how to
> open a file by delegating that to the layer below and the layer above just
> knows about smgr, not md.c (or other potential smgr implementations).
>
> The reason to keep this separate from the callbacks was that smgr IO going
> through shared buffers, bypassing shared buffers and different smgr
> implemenentations all could share the same subject implementation, even if
> callbacks would differ between these use cases.
>
>
> How about:
>
> > I.e., if two different uses of AIO can describe the identity of the file being
> > operated on the same way, it likely makes sense to use the same
> > subject. E.g. different smgr implementations can describe IO with
> > RelFileLocator, ForkNumber and BlockNumber and can thus share a subject. In
> > contrast, IO for a WAL file would be described with TimeLineID and XLogRecPtr
> > and it would not make sense to use the same subject for smgr and WAL.

Sounds good to include.

> > Can this have a sentence on how this fits in bounded shmem, given the lack of
> > guarantees about a backend's responsiveness? In other words, what makes it
> > okay to have requests take arbitrarily long to move from AHS_COMPLETED_SHARED
> > to AHS_COMPLETED_LOCAL?
>
> I agree this should be explained somewhere - but not sure this is the best
> place.
>
> The reason it's ok is that each backend has a limited number of AIO handles
> and if it runs out of IO handles we'll a) check if any IOs can be reclaimed b)
> wait for the oldest IO to finish.

Reading it again today, that topic may already have adequate coverage.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2025-01-07 19:20:09 Re: Moving the vacuum GUCs' docs out of the Client Connection Defaults section
Previous Message Mahendra Singh Thalor 2025-01-07 19:04:47 Re: Non-text mode for pg_dumpall