From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com> |
Cc: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, 陈宗志 <baotiao(at)gmail(dot)com> |
Subject: | Re: AIO v2.0 |
Date: | 2025-01-08 20:58:50 |
Message-ID: | exrjge7fo7hcqvmcfscbxti6vyzuyy7gs2wpjgmxpnvuvgrnud@mxhnya3f5oyp |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2025-01-08 15:04:39 +0100, Jakub Wartak wrote:
> On Mon, Jan 6, 2025 at 5:28 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > I didn't think that pg_stat_* was quite the right namespace, given that it
> > shows not stats, but the currently ongoing IOs. I am going with pg_aios for
> > now, but I don't particularly like that.
>
> If you are looking for other proposals:
> * pg_aios_progress ? (to follow pattern of pg_stat_copy|vaccuum_progress?)
> * pg_debug_aios ?
> * pg_debug_io ?
I think pg_aios is better than those, if not by much. Seems others are ok
with that name too. And we easily can evolve it later.
> > I think we'll want a pg_stat_aio as well, tracking things like:
> >
> > - how often the queue to IO workes was full
> > - how many times we submitted IO to the kernel (<= #ios with io_uring)
> > - how many times we asked the kernel for events (<= #ios with io_uring)
> > - how many times we had to wait for in-flight IOs before issuing more IOs
>
> If I could dream of one thing that would be 99.9% percentile of IO
> response times in milliseconds for different classes of I/O traffic
> (read/write/flush). But it sounds like it would be very similiar to
> pg_stat_io and potentially would have to be
> per-tablespace/IO-traffic(subject)-type too.
Yea, that's a significant project on its own. It's not that cheap to compute
reasonably accurate percentiles and we have no infrastructure for doing so
right now.
> AFAIU pg_stat_io has improper structure to have that there.
Hm, not obvious to me why? It might make the view a bit wide to add it as an
additional column, but otherwise I don't see a problem?
> BTW: before trying to even start to compile that AIO v2.2* and
> responding to the previous review, what are You looking interested to
> hear the most about it so that it adds some value?
Due to the rather limited "users" of AIO in the patchset, I think most
benchmarks aren't expected to show any meaningful gains. However, they
shouldn't show any significant regressions either (when not using direct
IO). I think trying to find regressions would be a rather valuable thing.
I'm tempted to collect a few of the reasonbly-ready read stream conversions
into the patchset, to make the potential gains more visible. But I am not sure
it's a good investment of time right now.
One small regression I do know about, namely scans of large relations that are
bigger than shared buffers but do fit in the kernel page cache. The increase
of BAS_BULKREAD does cause a small slowdown - but without it we never can do
sufficient asynchronous IO. I think the slowdown is small enough to just
accept that, but it's worth qualifying that on a few machines.
> Any workload specific measurements? just general feedback, functionality
> gaps?
To see the benefits it'd be interesting to compare:
1) sequential scan performance with data not in shared buffers, using buffered IO
2) same, but using direct IO when testing the patch
3) checkpoint performance
In my experiments 1) gains a decent amount of performance in many cases, but
nothing overwhelming - sequential scans are easy for the kernel to read ahead.
I do see very significant gains for 2) - On a system with 10 striped NVMe SSDs
that each can do ~3.5 GB/s I measured very parallel sequential scans (I had
to use ALTER TABLE to get sufficient numbers of workers):
master: ~18 GB/s
patch, buffered: ~20 GB/s
patch, direct, worker: ~28 GB/s
patch, direct, uring: ~35 GB/s
This was with io_workers=32, io_max_concurrency=128,
effective_io_concurrency=1000 (doesn't need to be that high, but it's what I
still have the numbers for).
This was without data checksums enabled as otherwise the checksum code becomes
a *huge* bottleneck.
I also see significant gains with 3). Bigger when using direct IO. One
complicating factor measuring 3) is that the first write to a block will often
be slower than subsequent writes because the filesystem will need to update
some journaled metadata, presenting a bottleneck.
Checkpoint performance is also severely limited by data checksum computation
if enabled - independent of this patchset.
One annoying thing when testing DIO is that right now VACUUM will be rather
slow if the data isn't already in s_b, as it isn't yet read-stream-ified.
> Integrity/data testing with stuff like dm-dust, dm-flakey, dm-delay
> to try the error handling routines?
Hm. I don't think that's going to work very well even on master. If the
filesystem fails there's not much that PG can do...
> Some kind of AIO <-> standby/recovery interactions?
I wouldn't expect anything there. I think Thomas somewhere has a patch that
read-stream-ifies recovery prefetching, once that's done it would be more
interesting.
> * - btw, Date: 2025-01-01 04:03:33 - I saw what you did there! so
> let's officially recognize the 2025 as the year of AIO in PG, as it
> was 1st message :D
Hah, that was actually the opposite of what I intended :). I'd hoped to post
earlier, but jetlag had caught up with me...
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Guillaume Lelarge | 2025-01-08 21:00:02 | Re: Non-text mode for pg_dumpall |
Previous Message | Tom Lane | 2025-01-08 20:58:25 | Re: improve DEBUG1 logging of parallel workers for CREATE INDEX? |