Re: AIO v2.0

From: Andres Freund <andres(at)anarazel(dot)de>
To: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, 陈宗志 <baotiao(at)gmail(dot)com>
Subject: Re: AIO v2.0
Date: 2025-01-08 20:58:50
Message-ID: exrjge7fo7hcqvmcfscbxti6vyzuyy7gs2wpjgmxpnvuvgrnud@mxhnya3f5oyp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2025-01-08 15:04:39 +0100, Jakub Wartak wrote:
> On Mon, Jan 6, 2025 at 5:28 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > I didn't think that pg_stat_* was quite the right namespace, given that it
> > shows not stats, but the currently ongoing IOs. I am going with pg_aios for
> > now, but I don't particularly like that.
>
> If you are looking for other proposals:
> * pg_aios_progress ? (to follow pattern of pg_stat_copy|vaccuum_progress?)
> * pg_debug_aios ?
> * pg_debug_io ?

I think pg_aios is better than those, if not by much. Seems others are ok
with that name too. And we easily can evolve it later.

> > I think we'll want a pg_stat_aio as well, tracking things like:
> >
> > - how often the queue to IO workes was full
> > - how many times we submitted IO to the kernel (<= #ios with io_uring)
> > - how many times we asked the kernel for events (<= #ios with io_uring)
> > - how many times we had to wait for in-flight IOs before issuing more IOs
>
> If I could dream of one thing that would be 99.9% percentile of IO
> response times in milliseconds for different classes of I/O traffic
> (read/write/flush). But it sounds like it would be very similiar to
> pg_stat_io and potentially would have to be
> per-tablespace/IO-traffic(subject)-type too.

Yea, that's a significant project on its own. It's not that cheap to compute
reasonably accurate percentiles and we have no infrastructure for doing so
right now.

> AFAIU pg_stat_io has improper structure to have that there.

Hm, not obvious to me why? It might make the view a bit wide to add it as an
additional column, but otherwise I don't see a problem?

> BTW: before trying to even start to compile that AIO v2.2* and
> responding to the previous review, what are You looking interested to
> hear the most about it so that it adds some value?

Due to the rather limited "users" of AIO in the patchset, I think most
benchmarks aren't expected to show any meaningful gains. However, they
shouldn't show any significant regressions either (when not using direct
IO). I think trying to find regressions would be a rather valuable thing.

I'm tempted to collect a few of the reasonbly-ready read stream conversions
into the patchset, to make the potential gains more visible. But I am not sure
it's a good investment of time right now.

One small regression I do know about, namely scans of large relations that are
bigger than shared buffers but do fit in the kernel page cache. The increase
of BAS_BULKREAD does cause a small slowdown - but without it we never can do
sufficient asynchronous IO. I think the slowdown is small enough to just
accept that, but it's worth qualifying that on a few machines.

> Any workload specific measurements? just general feedback, functionality
> gaps?

To see the benefits it'd be interesting to compare:

1) sequential scan performance with data not in shared buffers, using buffered IO
2) same, but using direct IO when testing the patch
3) checkpoint performance

In my experiments 1) gains a decent amount of performance in many cases, but
nothing overwhelming - sequential scans are easy for the kernel to read ahead.

I do see very significant gains for 2) - On a system with 10 striped NVMe SSDs
that each can do ~3.5 GB/s I measured very parallel sequential scans (I had
to use ALTER TABLE to get sufficient numbers of workers):

master: ~18 GB/s
patch, buffered: ~20 GB/s
patch, direct, worker: ~28 GB/s
patch, direct, uring: ~35 GB/s

This was with io_workers=32, io_max_concurrency=128,
effective_io_concurrency=1000 (doesn't need to be that high, but it's what I
still have the numbers for).

This was without data checksums enabled as otherwise the checksum code becomes
a *huge* bottleneck.

I also see significant gains with 3). Bigger when using direct IO. One
complicating factor measuring 3) is that the first write to a block will often
be slower than subsequent writes because the filesystem will need to update
some journaled metadata, presenting a bottleneck.

Checkpoint performance is also severely limited by data checksum computation
if enabled - independent of this patchset.

One annoying thing when testing DIO is that right now VACUUM will be rather
slow if the data isn't already in s_b, as it isn't yet read-stream-ified.

> Integrity/data testing with stuff like dm-dust, dm-flakey, dm-delay
> to try the error handling routines?

Hm. I don't think that's going to work very well even on master. If the
filesystem fails there's not much that PG can do...

> Some kind of AIO <-> standby/recovery interactions?

I wouldn't expect anything there. I think Thomas somewhere has a patch that
read-stream-ifies recovery prefetching, once that's done it would be more
interesting.

> * - btw, Date: 2025-01-01 04:03:33 - I saw what you did there! so
> let's officially recognize the 2025 as the year of AIO in PG, as it
> was 1st message :D

Hah, that was actually the opposite of what I intended :). I'd hoped to post
earlier, but jetlag had caught up with me...

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Guillaume Lelarge 2025-01-08 21:00:02 Re: Non-text mode for pg_dumpall
Previous Message Tom Lane 2025-01-08 20:58:25 Re: improve DEBUG1 logging of parallel workers for CREATE INDEX?