From: | Tobias Oberstein <tobias(dot)oberstein(at)gmail(dot)com> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: lseek/read/write overhead becomes visible at scale .. |
Date: | 2017-01-24 18:25:52 |
Message-ID: | a55b21d1-7c99-2c66-d661-ef5288f29e30@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
>> pid | syscall | cnt | cnt_per_sec
>> -----+---------------------------------------+---------+-------------
>> | syscalls:sys_enter_lseek | 4091584 | 136386
>> | syscalls:sys_enter_newfstat | 2054988 | 68500
>> | syscalls:sys_enter_read | 767990 | 25600
>> | syscalls:sys_enter_close | 503803 | 16793
>> | syscalls:sys_enter_newstat | 434080 | 14469
>> | syscalls:sys_enter_open | 380382 | 12679
>>
>> Note: there isn't a lot of load currently (this is from production).
>
> That doesn't really mean that much - sure it shows that lseek is
> frequent, but it doesn't tell you how much impact this has to the
Above is on a mostly idle system ("idle" for our loads) .. when things
get hot, lseek calls can reach into the millions/sec.
Doing 5 million syscalls per sec comes with overhead no matter how
lightweight the syscall is, doesn't it?
Using pread instead of lseek+read halfes the syscalls.
I really don't understand what you are fighting here ..
> overall workload. For that'd you'd need a generic (i.e. not syscall
> tracepoint, but cpu cycle) perf profile, and look in the call graph (via
> perf report --children) how much of that is below the lseek syscall.
I see. I might find time to extend our helper function f_perf_syscalls.
>>>>> I'm much less against this change than Tom, but doing artificial syscall
>>>>> microbenchmark seems unlikely to make a big case for using it in
>>>>
>>>> This isn't a syscall benchmark, but FIO.
>>>
>>> There's not really a difference between those, when you use fio to
>>> benchmark seek vs pseek.
>>
>> Sorry, I don't understand what you are talking about.
>
> Fio as you appear to have used is a microbenchmark benchmarking
> individual syscalls.
I am benchmarking IOPS, and while doing so, it becomes apparent that at
these scales it does matter _how_ IO is done.
The most efficient way is libaio. I get 9.7 million/sec IOPS with low
CPU load. Using any synchronous IO engine is slower and produces higher
load.
I do understand that switching to libaio isn't going to fly for PG
(completely different approach). But doing pread instead of lseek+read
seems simple enough. But then, I don't know about the PG codebase ..
Among the synchronous methods of doing IO, psync is much better than sync.
pvsync, pvsync2 and pvsync2 + hipri (busy polling, no interrupts) are
better, but the gain is smaller, and all of them are inferior to libaio.
>>> Glad to hear it.
>>
>> With 3TB RAM, huge pages is absolutely essential (otherwise, the system bogs
>> down in TLB etc overhead).
>
> I was one of the people working on adding hugepage support to pg, that's
> why I was glad ;)
Ahh;) Sorry, wasn't aware. This is really invaluable. Thanks for that!
Cheers,
/Tobias
From | Date | Subject | |
---|---|---|---|
Next Message | Alvaro Herrera | 2017-01-24 18:36:13 | Re: lseek/read/write overhead becomes visible at scale .. |
Previous Message | Corey Huinker | 2017-01-24 18:25:04 | Re: \if, \elseif, \else, \endif (was Re: PSQL commands: \quit_if, \quit_unless) |