From: | Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Melanie Plageman <melanieplageman(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
Subject: | Re: BitmapHeapScan streaming read user and prelim refactoring |
Date: | 2024-03-18 11:34:23 |
Message-ID: | 059b1665-38aa-450f-a7a0-0bdd37104069@enterprisedb.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 3/17/24 20:36, Tomas Vondra wrote:
>
> ...
>
>> Besides a lot of other things, I finally added debugging fprintfs printing the
>> pid, (prefetch, read), block number. Even looking at tiny excerpts of the
>> large amount of output that generates shows that two iterators were out of
>> sync.
>>
>
> Thanks. I did experiment with fprintf, but it's quite cumbersome, so I
> was hoping you came up with some smart way to trace this king of stuff.
> For example I was wondering if ebpf would be a more convenient way.
>
FWIW I just realized why I failed to identify this "late prefetch" issue
during my investigation. I was experimenting with instrumenting this by
adding a LD_PRELOAD library, logging all pread/fadvise calls. But the
FilePrefetch call is skipped in the page is already in shared buffers,
so this case "disappeared" during processing which matched the two calls
by doing an "inner join".
That being said, I think tracing this using LD_PRELOAD or perf may be
more convenient way to see what's happening. For example I ended up
doing this:
perf record -a -e syscalls:sys_enter_fadvise64 \
-e syscalls:sys_exit_fadvise64 \
-e syscalls:sys_enter_pread64 \
-e syscalls:sys_exit_pread64
perf script -ns
Alternatively, perf-trace can be used and prints the filename too (but
time has ms resolution only). Processing this seems comparable to the
fprintf approach.
It still has the issue that some of the fadvise calls may be absent if
the prefetch iterator gets too far behind, but I think that can be
detected / measured by simply counting the fadvise calls, and comparing
them to pread calls. We expect these to be about the same, so
(#pread - #fadvise) / #fadvise
is a measure of how many were "late" and skipped.
It also seems better than fprintf because it traces the actual syscalls,
not just calls to glibc wrappers. For example I saw this
postgres 54769 [001] 33768.771524828:
syscalls:sys_enter_pread64: ..., pos: 0x30d04000
postgres 54769 [001] 33768.771526867:
syscalls:sys_exit_pread64: 0x2000
postgres 54820 [000] 33768.771527473:
syscalls:sys_enter_fadvise64: ..., offset: 0x30d04000, ...
postgres 54820 [000] 33768.771528320:
syscalls:sys_exit_fadvise64: 0x0
which is clearly a case where we issue fadvise after pread of the same
block already completed.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From | Date | Subject | |
---|---|---|---|
Next Message | Amit Langote | 2024-03-18 11:35:18 | Re: Memory consumed by child SpecialJoinInfo in partitionwise join planning |
Previous Message | Alexander Korotkov | 2024-03-18 11:24:47 | Re: Add pg_basetype() function to obtain a DOMAIN base type |