Re: AIO v2.3

From: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>
Subject: Re: AIO v2.3
Date: 2025-02-06 10:50:04
Message-ID: CAKZiRmzP5K-SxMXU9Xr_Lsr8_-qKG4N2L6t1r1QegUMdpyhh1A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jan 23, 2025 at 5:29 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> Hi,
>
> Attached is v2.3.
>
> There are a lot of changes - primarily renaming things based on on-list and
> off-list feedback. But also some other things

[..snip]

Hi Andres, OK, so I've hastily launched AIO v2.3 (full, 29 patches)
patchset probe run before going for short vacations and here results
are attached*. TLDR; in terms of SELECTs the master vs aioworkers
looks very solid! I was kind of afraid that additional IPC to separate
processes would put workers at a disadvantage a little bit , but
that's amazingly not true. The intention of this effort just to see if
committing AIO with defaults as it stands is good enough to not cause
basic regressions for users and to me it looks like it is nearly
finished :)). So here to save time I have *not* tested aio23 with
io_uring, it's just about aioworkers (the future default).

Random notes and thoughts:

1. not a single crash was observed , but those were pretty short runs

2. my very limited in terms of time data analysis thoughts
- most of the time perf with aioworkers is identical (+/- 3%) as of
the master, in most cases it is much BETTER
- up to like 2.01x boosts can be spotted even on low-end like this but
with fast I/O even without IO_URING (just workers)
- on seqscans "sata" with datasets bigger than VFS-cache ("big") and
without parallel workers, it looks like it's always better
- on parallel seqscans "sata" with datasets bigger than VFS-cache
("big") and high e_io_c with high client counts(sigh!), it looks like
it would user noticeable big regression but to me it's not regression
itself, probably we are issuing way too many posix_fadvise()
readaheads with diminishing returns. Just letting you know. Not sure
it is worth introducing some global (shared aioworkers e_io_c
limiter), I think not. I think it could also be some maintenance noise
on that I/O device, but I have no isolated SATA RAID10 with like 8x
HDDs in home to launch such a test to be absolutely sure.

3. with aioworkers in documentation it would worth pointing out that
`iotop` won't be good enough to show which PID is doing I/O anymore .
I've often get question like this: who is taking the most of I/O right
now because storage is fully saturated on multi-use system. Not sure
it would require new view or not (pg_aios output seems to be not more
like in-memory debug view that would be have to be sampled
aggressively, and pg_statio_all_tables shows well table, but not PID
-- same for pg_stat_io). IMHO if docs would be simple like
"In order to understand which processes (PIDs) are issuing lots of
IOs, please check pg_stat_activty for *IO/AioCompletion* waits events"
it should be good enough for a start.

Bench machine: it was intentionally much smaller hardware. Azure's
Lsv2 L8s_v2 (1st gen EPYC/1s4c8t, with kernel 6.10.11+bpo-cloud-amd64
and booted with mem=12GB that limited real usable RAM memory to just
like ~8GB to stress I/O). liburing 2.9. Normal standard compile
options were used without asserts (such as normal users would use).
Bench had those two I/O storage (with XFS) attached:
- "sata" stands for Azure's "Premium SSD LRS" mounted on /sata
(Size=255GB, Max IOPS=1100 (@ 4kB?), Max throughput=125MB/s)
- "nvme" stands for bulit-in NVME on that VM mounted on /nvme
(Size=1788GB, Max IOPS=8000 (@ 4kB?))

I'll try to see in the coming weeks if dedicating more time is
possible (long run tests, more write tests, maybe some basic I/O
failure injections tests).

-J.

* = 8640 test runs, always with restart and flushing VFS cache, took
probably 2-3 days? I've had to reduce tries to 1 and limit myself to
just reads just to get it running solid, before I left and not to miss
the plane :^)

Attachment Content-Type Size
aio23_potential_parallel_seqscan_regression.png image/png 98.8 KB
results.out.bz2 application/x-compressed 97.9 KB
pg_aio23_basic_tests_v001.xlsx application/vnd.openxmlformats-officedocument.spreadsheetml.sheet 1.0 MB
pg_aio23_basic_tests_v001.ods application/vnd.oasis.opendocument.spreadsheet 770.4 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniil Davydov 2025-02-06 10:53:17 SLRU_PAGES_PER_SEGMENT as configure parameter
Previous Message Nisha Moond 2025-02-06 10:38:17 Re: Introduce XID age and inactive timeout based replication slot invalidation