Quick Links

Re: pgcon unconference / impact of block size on performance

From:	Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To:	Jakub Wartak <Jakub(dot)Wartak(at)tomtom(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject:	Re: pgcon unconference / impact of block size on performance
Date:	2022-06-08 14:51:41
Message-ID:	a568f335-4197-ec60-58ec-9f49f9ebe4b4@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 6/8/22 16:15, Jakub Wartak wrote:
> Hi, got some answers!
>
> TL;DR for fio it would make sense to use many stressfiles (instead of 1) and same for numjobs ~ VCPU to avoid various pitfails.
> >>>> The really
>>>> puzzling thing is why is the filesystem so much slower for smaller
>>>> pages. I mean, why would writing 1K be 1/3 of writing 4K?
>>>> Why would a filesystem have such effect?
>>>
>>> Ha! I don't care at this point as 1 or 2kB seems too small to handle
>>> many real world scenarios ;)
> [..]
>> Independently of that, it seems like an interesting behavior and it might tell us
>> something about how to optimize for larger pages.
>
> OK, curiosity won:
>
> With randwrite on ext4 directio using 4kb the avgqu-sz reaches ~90-100 (close to fio's 128 queue depth?) and I'm getting ~70k IOPS [with maxdepth=128]
> With randwrite on ext4 directio using 1kb the avgqu-sz is just 0.7 and I'm getting just ~17-22k IOPS [with maxdepth=128] -> conclusion: something is being locked thus preventing queue to build up
> With randwrite on ext4 directio using 4kb the avgqu-sz reaches ~2.3 (so something is queued) and I'm also getting ~70k IOPS with minimal possible maxdepth=4 -> conclusion: I just need to split the lock contention by 4.
>
> The 1kB (slow) profile top function is aio_write() -> .... -> iov_iter_get_pages() -> internal_get_user_pages_fast() and there's sadly plenty of "lock" keywords inside {related to memory manager, padding to full page size, inode locking} also one can find some articles / commits related to it [1] which didn't made a good feeling to be honest as the fio is using just 1 file (even while I'm on kernel 5.10.x). So I've switched to 4x files and numjobs=4 and got easily 60k IOPS, contention solved whatever it was :) So I would assume PostgreSQL (with it's splitting data files by default on 1GB boundaries and multiprocess architecture) should be relatively safe from such ext4 inode(?)/mm(?) contentions even with smallest 1kb block sizes on Direct I/O some day.
>

Interesting. So what parameter values would you suggest?

FWIW some of the tests I did were on xfs, so I wonder if that might be
hitting similar/other bottlenecks.

> [1] - https://www.phoronix.com/scan.php?page=news_item&px=EXT4-DIO-Faster-DBs
>
>>> Both scenarios (raw and fs) have had direct=1 set. I just cannot understand
>> how having direct I/O enabled (which disables caching) achieves better read
>> IOPS on ext4 than on raw device... isn't it contradiction?
>>>
>>
>> Thanks for the clarification. Not sure what might be causing this. Did you use the
>> same parameters (e.g. iodepth) in both cases?
>
> Explanation: it's the CPU scheduler migrations mixing the performance result during the runs of fio (as you have in your framework). Various VCPUs seem to be having varying max IOPS characteristics (sic!) and CPU scheduler seems to be unaware of it. At least on 1kB and 4kB blocksize this happens also notice that some VCPUs [XXXX marker] don't reach 100% CPU reaching almost twice the result; while cores 0, 3 do reach 100% and lack CPU power to perform more. The only thing that I don't get is that it doesn't make sense from extened lscpu output (but maybe it's AWS XEN mixing real CPU mappings, who knows).

Uh, that's strange. I haven't seen anything like that, but I'm running
on physical HW and not AWS, so it's either that or maybe I just didn't
do the same test.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

RE: pgcon unconference / impact of block size on performance at 2022-06-08 14:15:17 from Jakub Wartak

Responses

RE: pgcon unconference / impact of block size on performance at 2022-06-09 11:23:36 from Jakub Wartak

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Alvaro Herrera	2022-06-08 15:08:47	Re: Using PQexecQuery in pipeline mode produces unexpected Close messages
Previous Message	Stephen Frost	2022-06-08 14:16:26	Re: replacing role-level NOINHERIT with a grant-level option