Re: AIO v2.5

From: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Noah Misch <noah(at)leadboat(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: AIO v2.5
Date: 2025-03-07 10:21:09
Message-ID: CAKZiRmxiKzvrApJ8DZ9O15=cUGY=Bm_TyZg1wxxVW9B2rt7jdw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Mar 6, 2025 at 2:13 PM Andres Freund <andres(at)anarazel(dot)de> wrote:

> On 2025-03-06 12:36:43 +0100, Jakub Wartak wrote:
> > On Tue, Mar 4, 2025 at 8:00 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > > Questions:
> > >
> > > - My current thinking is that we'd set io_method = worker initially - so we
> > > actually get some coverage - and then decide whether to switch to
> > > io_method=sync by default for 18 sometime around beta1/2. Does that sound
> > > reasonable?
> >
> > IMHO, yes, good idea. Anyway final outcomes partially will depend on
> > how many other stream-consumers be committed, right?
>
> I think it's more whether we find cases where it performs substantially worse
> with the read stream users that exist. The behaviour for non-read-stream IO
> shouldn't change.

OK, so in order to to get full picture for v18beta this would mean
$thread + following ones?:
- Use read streams in autoprewarm
- BitmapHeapScan table AM violation removal (and use streaming read API)
- Index Prefetching (it seems it has stalled?)

or is there something more planned? (I'm asking what to apply on top
of AIO to minimize number of potential test runs which seem to take
lots of time, so to do it all in one go)

> > So, I've taken aio-2 branch from Your's github repo for a small ride
> > on legacy RHEL 8.7 with dm-flakey to inject I/O errors. This is more a
> > question: perhaps IO workers should auto-close fd on errors or should
> > we use SIGUSR2 for it? The scenario is like this:
>
> When you say "auto-close", you mean that one IO error should trigger *all*
> workers to close their FDs?

Yeah I somehow was thinking about such a thing, but after You have
bolded that "*all*", my question sounds much more stupid than it was
yesterday. Sorry for asking stupid question :)

> The same is already true with bgwriter, checkpointer etc?

Yeah.. I was kind of looking for a way of getting "higher
availability" in the presence of partial IO (tablespace) errors.

> > pg_terminate_backend() on those won't work. The only thing that works seems
> > to be sending SIGUSR2
>
> Sending SIGINT works.

Ugh, ok, it looks like I've been overthinking that, cool.

> > , but is that safe [there could be some errors after pwrite() ]?
>
> Could you expand on that?

It is pure speculation on my side: well I'm always concerned about
leaving something out there without cleanup after errors and then
re-using it for something else much later, especially on edge-cases
like NFS or FUSE. In the backend we could maintain some state, but
io_workes are shared across backends. E.g. some pwrite() failing on
NFS, we are not closing that fd, and then reusing it for something
else much latter for different backend (although AFAIK close() does
not guarantee anything, but e.g. it could be that some inode/path or
something was simply marked dangling - the fresh pair of
close()/open() could could could return error, but here we would just
keep on pwriting() there?).

OK the only question remains: does it make sense to try something like
pgbench on NFS UDP mountopt=hard,nointr + intermittent iptables DROP
from time to time , or is it not worth trying?

> > With
> > io_worker=sync just quitting the backend of course works. Not sure
> > what your thoughts are because any other bgworker could be having open
> > fds there. It's a very minor thing. Otherwise that outage of separate
> > tablespace (rarely used) would potentially cause inability to fsck
> > there and lower the availability of the DB (due to potential restart
> > required).
>
> I think a crash-restart is the only valid thing to get out of a scenario like
> that, independent of AIO:
>
> - If there had been any writes we need to perform crash recovery anyway, to
> recreate those writes
> - If there just were reads, it's good to restart as well, as otherwise there
> might be pages in the buffer pool that don't exist on disk anymore, due to
> the errors.

OK, cool, thanks!

-J.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andreas Karlsson 2025-03-07 10:26:29 Re: Commitfest app release on Feb 17 with many improvements
Previous Message Jakub Wartak 2025-03-07 10:20:03 Re: Draft for basic NUMA observability