From: | torikoshia <torikoshia(at)oss(dot)nttdata(dot)com> |
---|---|
To: | Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>, tgl(at)sss(dot)pgh(dot)pa(dot)us |
Cc: | Bruce Momjian <bruce(at)momjian(dot)us>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: RFC: Allow EXPLAIN to Output Page Fault Information |
Date: | 2025-01-06 09:49:06 |
Message-ID: | 1f22794321b745549d54359d343e37b8@oss.nttdata.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Dec 31, 2024 at 1:39 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Bruce Momjian <bruce(at)momjian(dot)us> writes:
>
>> I certainly would love to see storage I/O numbers as distinct from
>> kernel read I/O numbers.
>
> Me too, but I think it is 100% wishful thinking to imagine that
> page fault counts match up with that. Maybe there are filesystems
> where a read that we request maps one-to-one with a subsequent
> page fault, but it hardly seems likely to me that that's
> universal. Also, you can't tell page faults for reading program
> code apart from those for data, and you won't get any information
> at all about writes.
Thanks for the explanation.
On Tue, Dec 31, 2024 at 7:57 AM Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
wrote:
> On Mon Dec 30, 2024 at 5:39 PM CET, Tom Lane wrote:
>> Bruce Momjian <bruce(at)momjian(dot)us> writes:
>>> I certainly would love to see storage I/O numbers as distinct from
>>> kernel read I/O numbers.
>>
>> Me too, but I think it is 100% wishful thinking to imagine that
>> page fault counts match up with that.
>
> Okay I played around with this patch a bit, in hopes of proving you
> wrong. But I now agree with you. I cannot seem to get any numbers out
> of
> this that make sense.
>
> The major page fault numbers are always zero, even after running:
>
> echo 1 > /proc/sys/vm/drop_caches
>
> If Takahori has a way to get some more useful insights from this patch,
> I'm quite interested in the steps he took (I might very well have
> missed
> something obvious).
Thanks for testing.
I also did pg_ctl restart to clear buffercache in addition to your step
and saw many major faults again.
However, when I replaced the restart with pg_buffercache_evict(), I also
observed too few number of major fault.
I now feel majflt from getrusage() is not appropriate metrics for
measuring storage I/O.
> **However, I think the general direction has merit**: Changing this
> patch to
> use `ru_inblock`/`ru_oublock` gives very useful insights. `ru_inblock`
> is 0 when everything is in page cache, and it is very high when stuff
> is
> not. I was only hacking around and basically did this:
>
> s/ru_minflt/ru_inblock/g
> s/ru_majflt/ru_oublock/g
Great!
I misunderstood these metrics contain page cached I/O.
As far as I inspected, they come from read_bytes/write_bytes of
task_io_accounting and the comment seems they are what we want, i.e.
storage I/O:
--
/usr/src/linux-headers-5.15.0-127/include/linux/task_io_accounting.h
struct task_io_accounting {
..(snip)..
#ifdef CONFIG_TASK_IO_ACCOUNTING
/*
* The number of bytes which this task has caused to be read
from
* storage.
*/
u64 read_bytes;
/*
* The number of bytes which this task has caused, or shall
cause to be
* written to disk.
*/
u64 write_bytes;
> Obviously more is needed. We'd probably want to show these numbers in
> useful units like MB or something. Also, maybe there's some better way
> of getting read/write numbers for the current process than
> ru_inblock/ru_oublock (but this one seems to work at least reasonably
> well).
Updated the PoC patch to calculate them by KB:
=# EXPLAIN (ANALYZE, STORAGEIO) SELECT * FROM pgbench_accounts;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Seq Scan on pgbench_accounts (cost=0.00..263935.35 rows=10000035
width=97) (actual time=1.447..3900.279 rows=10000000 loops=1)
Buffers: shared hit=2587 read=161348
Planning Time: 0.367 ms
Execution:
Storage I/O: read=1291856 KB write=0 KB
Execution Time: 4353.253 ms
(6 rows)
> Also, maybe there's some better way
> of getting read/write numbers for the current process than
> ru_inblock/ru_oublock (but this one seems to work at least reasonably
> well).
Maybe, but as far as using getrusage(), ru_inblock and ru_outblock seem
the best.
> One other thing that I noticed when playing around with this, which
> would need to be addressed: Parallel workers need to pass these values
> to the main process somehow, otherwise the IO from those processes gets
> lost.
Yes.
I haven't developed it yet but I believe we can pass them like
buffer/WAL usage.
--
Regards,
--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.
Attachment | Content-Type | Size |
---|---|---|
v1-0001-PoC-Allow-EXPLAIN-to-output-storage-I-O-informati.patch | text/x-diff | 10.5 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Kohei Harikae (Fujitsu) | 2025-01-06 09:57:58 | RE: Windows meson build |
Previous Message | 赵宇鹏 (宇彭) | 2025-01-06 09:41:49 | Temporary Views Cleanup Issue |