Re: Performance degradation on concurrent COPY into a single relation in PG16.

From: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Performance degradation on concurrent COPY into a single relation in PG16.
Date: 2023-07-11 07:09:43
Message-ID: CAKZiRmyQ76T83FCsQxNDxq_mf8fcwE4O=yZk8re0GVfJDS1mhg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jul 10, 2023 at 6:24 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,
>
> On 2023-07-03 11:53:56 +0200, Jakub Wartak wrote:
> > Out of curiosity I've tried and it is reproducible as you have stated : XFS
> > @ 4.18.0-425.10.1.el8_7.x86_64:
> >...
> > According to iostat and blktrace -d /dev/sda -o - | blkparse -i - output ,
> > the XFS issues sync writes while ext4 does not, xfs looks like constant
> > loop of sync writes (D) by kworker/2:1H-kblockd:
>
> That clearly won't go well. It's not reproducible on newer systems,
> unfortunately :(. Or well, fortunately maybe.
>
>
> I wonder if a trick to avoid this could be to memorialize the fact that we
> bulk extended before and extend by that much going forward? That'd avoid the
> swapping back and forth.

I haven't seen this thread [1] "Question on slow fallocate", from XFS
mailing list being mentioned here (it was started by Masahiko), but I
do feel it contains very important hints even challenging the whole
idea of zeroing out files (or posix_fallocate()). Please especially
see Dave's reply. He also argues that posix_fallocate() !=
fallocate(). What's interesting is that it's by design and newer
kernel versions should not prevent such behaviour, see my testing
result below.

All I can add is that this those kernel versions (4.18.0) seem to very
popular across customers (RHEL, Rocky) right now and that I've tested
on most recent available one (4.18.0-477.15.1.el8_8.x86_64) using
Masahiko test.c and still got 6-7x slower time when using XFS on that
kernel. After installing kernel-ml (6.4.2) the test.c result seems to
be the same (note it it occurs only when 1st allocating space, but of
course it doesnt if the same file is rewritten/"reallocated"):

[root(at)rockyora ~]# uname -r
6.4.2-1.el8.elrepo.x86_64
[root(at)rockyora ~]# time ./test test.0 0
total 200000
fallocate 0
filewrite 200000

real 0m0.405s
user 0m0.006s
sys 0m0.391s
[root(at)rockyora ~]# time ./test test.0 1
total 200000
fallocate 200000
filewrite 0

real 0m0.137s
user 0m0.005s
sys 0m0.132s
[root(at)rockyora ~]# time ./test test.1 1
total 200000
fallocate 200000
filewrite 0

real 0m0.968s
user 0m0.020s
sys 0m0.928s
[root(at)rockyora ~]# time ./test test.2 2
total 200000
fallocate 100000
filewrite 100000

real 0m6.059s
user 0m0.000s
sys 0m0.788s
[root(at)rockyora ~]# time ./test test.2 2
total 200000
fallocate 100000
filewrite 100000

real 0m0.598s
user 0m0.003s
sys 0m0.225s
[root(at)rockyora ~]#

iostat -x reports during first "time ./test test.2 2" (as you can see
w_awiat is not that high but it accumulates):
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s
%rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sda 0.00 15394.00 0.00 122.02 0.00 13.00
0.00 0.08 0.00 0.05 0.75 0.00 8.12 0.06 100.00
dm-0 0.00 15407.00 0.00 122.02 0.00 0.00
0.00 0.00 0.00 0.06 0.98 0.00 8.11 0.06 100.00

So maybe that's just a hint that you should try on slower storage
instead? (I think that on NVMe this issue would be hardly noticeable
due to low IO latency, not like here)

-J.

[1] - https://www.spinics.net/lists/linux-xfs/msg73035.html

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2023-07-11 07:16:09 Re: add non-option reordering to in-tree getopt_long
Previous Message Hayato Kuroda (Fujitsu) 2023-07-11 07:04:29 RE: doc: clarify the limitation for logical replication when REPILICA IDENTITY is FULL