From: | Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, 陈宗志 <baotiao(at)gmail(dot)com> |
Subject: | Re: AIO v2.0 |
Date: | 2024-11-18 12:19:58 |
Message-ID: | CAKZiRmzYaLmsuSPN-Mc3s2CM1Gs+LDkPMsKism_M9j50g8916A@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, Sep 6, 2024 at 9:38 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> Hi,
>
> Attached is the next version of the patchset. (..)
Hi Andres,
Thank You for worth admiring persistence on this. Please do not take it as
criticism, just more like set of questions regarding the patchset v2.1 that
I finally got little time to play with:
0. Doesn't the v2.1-0011-aio-Add-io_uring-method.patch -> in
pgaio_uring_submit() -> io_uring_get_sqe() need a return value check ?
Otherwise we'll never know that SQ is full in theory, perhaps at least such
a check should be made with Assert() ? (I understand right now that we
allow just up to io_uring_queue_init(io_max_concurrency), but what happens
if:
a. previous io_uring_submit() failed for some reason and we do not have
free space for SQ?
b. (hypothetical) someday someone will try to make PG multithreaded and the
code starts using just one big queue, still without checking for
io_uring_get_sqe()?
1. In [0] you wrote that there's this high amount of FDs consumed for
io_uring (dangerously close to RLIMIT_NOFILE). I can attest that there are
many customers who are using extremely high max_connections (4k-5k, but
there outliers with 10k in the wild too) - so they won't even start - and I
have one doubt on the user-friendliness impact of this. I'm quite certain
it's going to be the same as with pgbouncer where one is forced to tweak
OS(systemd/pam/limits.conf/etc), but in PG we are better because PG tries
to preallocate and then close() a lot of FDs, so that's safer in runtime.
IMVHO even if we just consume e.g. say > 30% of FDs just for io_uring, the
max_files_per_process looses it's spirit a little bit and PG is going to
start loose efficiency too due to frequent open()/close() calls as fd cache
is too small. Tomas also complained about it some time ago in [1])
So maybe it would be good to introduce couple of sanity checks too (even
after setting higher limit):
- issue FATAL in case of using io_method = io_ring && max_connections would
be close to getrusage(RLIMIT_NOFILE)
- issue warning in case of using io_method = io_ring && we wouldnt have
even real 1k FDs free for handling relation FDs (detect something bad like:
getrusage(RLIMIT_NOFILE) <= max_connections + max_files_per_process)
2. In pgaio_uring_postmaster_child_init_local() there
"io_uring_queue_init(32,...)" - why 32? :) And also there's separate
io_uring_queue_init(io_max_concurrency) which seems to be derived from
AioChooseMaxConccurrency() which can go up to 64?
3. I find having two GUCs named literally the same
(effective_io_concurrency, io_max_concurrency). It is clear from IO_URING
perspective what is io_max_concurrency all about, but I bet having also
effective_io_concurrency in the mix is going to be a little confusing for
users (well, it is to me). Maybe that README.md could elaborate a little
bit on the relation between those two? Or maybe do you plan to remove
io_max_concurrency and bind it to effective_io_concurrency in future? To
add more fun , there's MAX_IO_CONCURRENCY nearby in v2.1-0014 too while the
earlier mentioned AioChooseMaxConccurrency() goes up to just 64
4. While we are at this, shouldn't the patch rename debug_io_direct to
simply io_direct so that GUCs are consistent in terms of naming?
5. It appears that pg_stat_io.reads seems to be not refreshed until they
query seems to be finished. While running a query for minutes with this
patchset, I've got:
now | reads | read_time
-------------------------------+----------+-----------
2024-11-15 12:09:09.151631+00 | 15004271 | 0
[..]
2024-11-15 12:10:25.241175+00 | 15004271 | 0
2024-11-15 12:10:26.241179+00 | 15004271 | 0
2024-11-15 12:10:27.241139+00 | 18250913 | 0
Or is that how it is supposed to work? Also pg_stat_io.read_time would be
something vague with io_uring/worker, so maybe zero is good here (?).
Otherwise we would have to measure time spent on waiting alone, but that
would force more instructions for calculating io times...
6. After playing with some basic measurements - which went fine, I wanted
to go test simple PostGIS even with sequential scans to see any
compatibility issues (AFAIR Thomas Munro on PGConfEU indicated as good
testing point), but before that I've tried to see what's the TOAST
performance alone with AIO+DIO (debug_io_direct=data). One issue I have
found is that DIO seems to be unusable until somebody will teach TOAST to
use readstreams, is that correct? Maybe I'm doing something wrong, but I
haven't seen any TOAST <-> readstreams topic:
-- 12MB table , 25GB toast
create table t (id bigint, t text storage external);
insert into t select i::bigint as id, repeat(md5(i::text),4000)::text as r
from generate_series(1,200000) s(i);
set max_parallel_workers_per_gather=0;
\timing
-- with cold caches: empty s_b, echo 3 > drop_caches
select sum(length(t)) from t;
master 101897.823 ms (01:41.898)
AIO 99758.399 ms (01:39.758)
AIO+DIO 191479.079 ms (03:11.479)
hotpath was detoast_attr() -> toast_fetch_datum() ->
heap_fetch_toast_slice() -> systable_getnext_ordered() ->
index_getnext_slot() -> index_fetch_heap() -> heapam_index_fetch_tuple() ->
ReadBufferExtended -> AIO code.
The difference is that on cold caches with DIO gets 2x slowdown; with clean
s_b and so on:
* getting normal heap data seqscan: up to 285MB/s
* but TOASTs maxes out at 132MB/s when using io_uring+DIO
Not about patch itself, but questions about related stack functionality:
----------------------------------------------------------------------------------------------------
7. Is pg_stat_aios still on the table or not ? (AIO 2021 had it). Any hints
on how to inspect real I/O calls requested to review if the code is issuing
sensible calls: there's no strace for uring, or do you stick to DEBUG3 or
perhaps using some bpftrace / xfsslower is the best way to go ?
8. Not sure if that helps, but I've managed the somehow to hit the
impossible situation You describe in pgaio_uring_submit() "(ret !=
num_staged_ios)", but I had to push urings really hard into using futexes
and probably I've could made some error in coding too for that too occur
[3]. As it stands in that patch from my thread, it was not covered: /*
FIXME: fix ret != submitted ?! seems like bug?! */ (but i had that hit that
code-path pretty often with 6.10.x kernel)
9. Please let me know, what's the current up to date line of thinking about
this patchset: is it intended to be committed as v18 ? As a debug feature
or as non-debug feature? (that is which of the IO methods should be
scrutinized the most as it is going to be the new default - sync or worker?)
10. At this point, does it even make sense to give a try experimenty try to
pwritev2() with RWF_ATOMIC? (that thing is already in the open, but XFS is
going to cover it with 6.12.x apparently, but I could try with some -rcX)
-J.
p.s. I hope I did not ask stupid questions nor missed anything.
[0] -
https://www.postgresql.org/message-id/237y5rabqim2c2v37js53li6i34v2525y2baf32isyexecn4ic%40bqmlx5mrnwuf
- "Right now the io_uring mode has each backend's io_uring instance visible
to
each other process.(..)"
[1] -
https://www.postgresql.org/message-id/flat/510b887e-c0ce-4a0c-a17a-2c6abb8d9a5c%40enterprisedb.com
- sentence after: "FWIW there's another bottleneck people may not realize
(..)"
From | Date | Subject | |
---|---|---|---|
Next Message | jian he | 2024-11-18 12:42:56 | Re: NOT ENFORCED constraint feature |
Previous Message | Rafia Sabih | 2024-11-18 11:49:23 | Re: Forbid to DROP temp tables of other sessions |