Re: pgcon unconference / impact of block size on performance

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Jakub Wartak <Jakub(dot)Wartak(at)tomtom(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: pgcon unconference / impact of block size on performance
Date: 2022-06-09 22:24:29
Message-ID: 07b8a5d9-ed76-a494-9998-4142aea92259@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 6/9/22 13:23, Jakub Wartak wrote:
>>>>>>> The really
>>>>>> puzzling thing is why is the filesystem so much slower for smaller
>>>>>> pages. I mean, why would writing 1K be 1/3 of writing 4K?
>>>>>> Why would a filesystem have such effect?
>>>>>
>>>>> Ha! I don't care at this point as 1 or 2kB seems too small to handle
>>>>> many real world scenarios ;)
>>> [..]
>>>> Independently of that, it seems like an interesting behavior and it
>>>> might tell us something about how to optimize for larger pages.
>>>
>>> OK, curiosity won:
>>>
>>> With randwrite on ext4 directio using 4kb the avgqu-sz reaches ~90-100
>>> (close to fio's 128 queue depth?) and I'm getting ~70k IOPS [with
>>> maxdepth=128] With randwrite on ext4 directio using 1kb the avgqu-sz is just
>> 0.7 and I'm getting just ~17-22k IOPS [with maxdepth=128] -> conclusion:
>> something is being locked thus preventing queue to build up With randwrite on
>> ext4 directio using 4kb the avgqu-sz reaches ~2.3 (so something is queued) and
>> I'm also getting ~70k IOPS with minimal possible maxdepth=4 -> conclusion: I
>> just need to split the lock contention by 4.
>>>
>>> The 1kB (slow) profile top function is aio_write() -> .... -> iov_iter_get_pages()
>> -> internal_get_user_pages_fast() and there's sadly plenty of "lock" keywords
>> inside {related to memory manager, padding to full page size, inode locking}
>> also one can find some articles / commits related to it [1] which didn't made a
>> good feeling to be honest as the fio is using just 1 file (even while I'm on kernel
>> 5.10.x). So I've switched to 4x files and numjobs=4 and got easily 60k IOPS,
>> contention solved whatever it was :) So I would assume PostgreSQL (with it's
>> splitting data files by default on 1GB boundaries and multiprocess architecture)
>> should be relatively safe from such ext4 inode(?)/mm(?) contentions even with
>> smallest 1kb block sizes on Direct I/O some day.
>>>
>>
>> Interesting. So what parameter values would you suggest?
>
> At least have 4x filename= entries and numjobs=4
>
>> FWIW some of the tests I did were on xfs, so I wonder if that might be hitting
>> similar/other bottlenecks.
>
> Apparently XFS also shows same contention on single file for 1..2kb randwrite, see [ZZZ].
>

I don't have any results yet, but after thinking about this a bit I find
this really strange. Why would there be any contention with a single fio
job? Doesn't contention imply multiple processes competing for the same
resource/lock etc.?

Isn't this simply due to the iodepth increase? IIUC with multiple fio
jobs, each will use a separate iodepth value. So with numjobs=4, we'll
really use iodepth*4, which can make a big difference.

>>>
>>> Explanation: it's the CPU scheduler migrations mixing the performance result
>> during the runs of fio (as you have in your framework). Various VCPUs seem to
>> be having varying max IOPS characteristics (sic!) and CPU scheduler seems to be
>> unaware of it. At least on 1kB and 4kB blocksize this happens also notice that
>> some VCPUs [XXXX marker] don't reach 100% CPU reaching almost twice the
>> result; while cores 0, 3 do reach 100% and lack CPU power to perform more.
>> The only thing that I don't get is that it doesn't make sense from extened lscpu
>> output (but maybe it's AWS XEN mixing real CPU mappings, who knows).
>>
>> Uh, that's strange. I haven't seen anything like that, but I'm running on physical
>> HW and not AWS, so it's either that or maybe I just didn't do the same test.
>
> I couldn't belived it until I've checked via taskset 😊 BTW: I don't
> have real HW with NVMe , but we might be with worth checking if
> placing (taskset -c ...) fio on hyperthreading VCPU is not causing
> (there's /sys/devices/system/cpu/cpu0/topology/thread_siblings and
> maybe lscpu(1) output). On AWS I have feeling that lscpu might simply
> lie and I cannot identify which VCPU is HT and which isn't.

Did you see the same issue with io_uring?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2022-06-09 22:29:09 Re: Collation version tracking for macOS
Previous Message Peter Geoghegan 2022-06-09 21:35:30 Re: better page-level checksums