From: | Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Performance degradation on concurrent COPY into a single relation in PG16. |
Date: | 2023-07-11 07:09:43 |
Message-ID: | CAKZiRmyQ76T83FCsQxNDxq_mf8fcwE4O=yZk8re0GVfJDS1mhg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Jul 10, 2023 at 6:24 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,
>
> On 2023-07-03 11:53:56 +0200, Jakub Wartak wrote:
> > Out of curiosity I've tried and it is reproducible as you have stated : XFS
> > @ 4.18.0-425.10.1.el8_7.x86_64:
> >...
> > According to iostat and blktrace -d /dev/sda -o - | blkparse -i - output ,
> > the XFS issues sync writes while ext4 does not, xfs looks like constant
> > loop of sync writes (D) by kworker/2:1H-kblockd:
>
> That clearly won't go well. It's not reproducible on newer systems,
> unfortunately :(. Or well, fortunately maybe.
>
>
> I wonder if a trick to avoid this could be to memorialize the fact that we
> bulk extended before and extend by that much going forward? That'd avoid the
> swapping back and forth.
I haven't seen this thread [1] "Question on slow fallocate", from XFS
mailing list being mentioned here (it was started by Masahiko), but I
do feel it contains very important hints even challenging the whole
idea of zeroing out files (or posix_fallocate()). Please especially
see Dave's reply. He also argues that posix_fallocate() !=
fallocate(). What's interesting is that it's by design and newer
kernel versions should not prevent such behaviour, see my testing
result below.
All I can add is that this those kernel versions (4.18.0) seem to very
popular across customers (RHEL, Rocky) right now and that I've tested
on most recent available one (4.18.0-477.15.1.el8_8.x86_64) using
Masahiko test.c and still got 6-7x slower time when using XFS on that
kernel. After installing kernel-ml (6.4.2) the test.c result seems to
be the same (note it it occurs only when 1st allocating space, but of
course it doesnt if the same file is rewritten/"reallocated"):
[root(at)rockyora ~]# uname -r
6.4.2-1.el8.elrepo.x86_64
[root(at)rockyora ~]# time ./test test.0 0
total 200000
fallocate 0
filewrite 200000
real 0m0.405s
user 0m0.006s
sys 0m0.391s
[root(at)rockyora ~]# time ./test test.0 1
total 200000
fallocate 200000
filewrite 0
real 0m0.137s
user 0m0.005s
sys 0m0.132s
[root(at)rockyora ~]# time ./test test.1 1
total 200000
fallocate 200000
filewrite 0
real 0m0.968s
user 0m0.020s
sys 0m0.928s
[root(at)rockyora ~]# time ./test test.2 2
total 200000
fallocate 100000
filewrite 100000
real 0m6.059s
user 0m0.000s
sys 0m0.788s
[root(at)rockyora ~]# time ./test test.2 2
total 200000
fallocate 100000
filewrite 100000
real 0m0.598s
user 0m0.003s
sys 0m0.225s
[root(at)rockyora ~]#
iostat -x reports during first "time ./test test.2 2" (as you can see
w_awiat is not that high but it accumulates):
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s
%rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sda 0.00 15394.00 0.00 122.02 0.00 13.00
0.00 0.08 0.00 0.05 0.75 0.00 8.12 0.06 100.00
dm-0 0.00 15407.00 0.00 122.02 0.00 0.00
0.00 0.00 0.00 0.06 0.98 0.00 8.11 0.06 100.00
So maybe that's just a hint that you should try on slower storage
instead? (I think that on NVMe this issue would be hardly noticeable
due to low IO latency, not like here)
-J.
From | Date | Subject | |
---|---|---|---|
Next Message | Kyotaro Horiguchi | 2023-07-11 07:16:09 | Re: add non-option reordering to in-tree getopt_long |
Previous Message | Hayato Kuroda (Fujitsu) | 2023-07-11 07:04:29 | RE: doc: clarify the limitation for logical replication when REPILICA IDENTITY is FULL |