automatic analyze: readahead - add "IO read time" log message

From: Jakub Wartak <Jakub(dot)Wartak(at)tomtom(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: automatic analyze: readahead - add "IO read time" log message
Date: 2020-10-26 12:21:01
Message-ID: VI1PR0701MB6960361E83FE46E71FBC8811F6190@VI1PR0701MB6960.eurprd07.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Greetings hackers,

I have I hope interesting observation (and nano patch proposal) on system where statistics freshness is a critical factor. Autovacuum/autogathering statistics was tuned to be pretty very aggressive:
autovacuum_vacuum_cost_delay=0 (makes autovacuum_vacuum_cost_limit irrelevant)
autovacuum_naptime=1s
autovacuum_max_workers=4

Some critical table partitions are configured with: autovacuum_analyze_scale_factor=0.001, autovacuum_analyze_threshold=50000 to force auto-analyze jobs pretty often. The interesting logs are like this:
automatic analyze of table "t1" system usage: CPU: user: 37.52 s, system: 23.01 s, elapsed: 252.14 s
automatic analyze of table "t2" system usage: CPU: user: 38.70 s, system: 22.63 s, elapsed: 317.33 s
automatic analyze of table "t2" system usage: CPU: user: 39.38 s, system: 21.43 s, elapsed: 213.58 s
automatic analyze of table "t1" system usage: CPU: user: 37.91 s, system: 24.49 s, elapsed: 229.45 s

and this is root-cause of my question. As you can see there is huge 3x-4x time discrepancy between "elapsed" and user+system which is strange at least for me as there should be no waiting (autovacuum_vacuum_cost_delay=0). According to various tools it is true: Time was wasted somewhere else, but not in the PostgreSQL analyze. The ps(1) and pidstat(1) also report the same for the worker:

06:56:12 AM PID %usr %system %guest %CPU CPU Command
06:56:13 AM 114774 8.00 10.00 0.00 18.00 18 postgres
06:56:14 AM 114774 8.00 11.00 0.00 19.00 15 postgres
06:56:15 AM 114774 5.00 13.00 0.00 18.00 18 postgres

06:56:17 AM PID kB_rd/s kB_wr/s kB_ccwr/s Command
06:56:18 AM 114774 63746.53 0.00 0.00 postgres
06:56:19 AM 114774 62896.00 0.00 0.00 postgres
06:56:20 AM 114774 62920.00 0.00 0.00 postgres

One could argue that such autoanalyze worker could perform 5x more of work (%CPU -> 100%) here. The I/O system is not performing a lot (total = 242MB/s reads(at)22k IOPS, 7MB/s writes(at)7k IOPS, with service time 0.04ms), although reporting high utilization I'm pretty sure it could push much more. There can be up to 3x-4x of such 70-80MB/s analyzes in parallel (let's say 300MB/s for statistics collection alone).

According to various gdb backtraces, a lot of time is spent here:
#0 0x00007f98cdfc9f73 in __pread_nocancel () from /lib64/libpthread.so.0
#1 0x0000000000741a16 in pread (__offset=811253760, __nbytes=8192, __buf=0x7f9413ab7280, __fd=<optimized out>) at /usr/include/bits/unistd.h:84
#2 FileRead (file=<optimized out>, buffer=0x7f9413ab7280 "\037\005", amount=8192, offset=811253760, wait_event_info=167772173) at fd.c:1883
#3 0x0000000000764b8f in mdread (reln=<optimized out>, forknum=<optimized out>, blocknum=19890902, buffer=0x7f9413ab7280 "\037\005") at md.c:596
#4 0x000000000073d69b in ReadBuffer_common (smgr=<optimized out>, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=19890902, mode=RBM_NORMAL, strategy=0x1102278, hit=0x7fffba7e2d4f)
at bufmgr.c:897
#5 0x000000000073e27e in ReadBufferExtended (reln=0x7f98c0c9ded0, forkNum=MAIN_FORKNUM, blockNum=19890902, mode=<optimized out>, strategy=<optimized out>) at bufmgr.c:665
#6 0x00000000004c2e2f in heapam_scan_analyze_next_block (scan=<optimized out>, blockno=<optimized out>, bstrategy=<optimized out>) at heapam_handler.c:998
#7 0x0000000000597de1 in table_scan_analyze_next_block (bstrategy=<optimized out>, blockno=<optimized out>, scan=0x10c8098) at ../../../src/include/access/tableam.h:1462
#8 acquire_sample_rows (onerel=0x7f98c0c9ded0, elevel=13, rows=0x1342e08, targrows=1500000, totalrows=0x7fffba7e3160, totaldeadrows=0x7fffba7e3158) at analyze.c:1048
#9 0x0000000000596a50 in do_analyze_rel (onerel=0x7f98c0c9ded0, params=0x10744e4, va_cols=0x0, acquirefunc=0x597ca0 <acquire_sample_rows>, relpages=26763525, inh=false,
in_outer_xact=false, elevel=13) at analyze.c:502
[..]
#12 0x00000000006e76b4 in autovacuum_do_vac_analyze (bstrategy=0x1102278, tab=<optimized out>) at autovacuum.c:3118
[..]

The interesting thing that targrows=1.5mlns and that blocks are accessed (as expected) in sorted order:

Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890910, bstrategy=0x1102278) at heapam_handler.c:984
Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890912, bstrategy=0x1102278) at heapam_handler.c:984
Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890922, bstrategy=0x1102278) at heapam_handler.c:984
Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890935, bstrategy=0x1102278) at heapam_handler.c:984
Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890996, bstrategy=0x1102278) at heapam_handler.c:984

And probably this explains the discrepancy, perf with CPU usage reporting shows a lot of frames waiting on I/O on readaheads done by ext4, trimmed for clarity:

# Children Self sys usr Command Shared Object Symbol
63.64% 0.00% 0.00% 0.00% postgres [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
---entry_SYSCALL_64_after_hwframe
do_syscall_64
|--59.66%--sys_pread64
| vfs_read
| --59.09%--__vfs_read
| --58.24%--generic_file_read_iter
| |--47.44%--ondemand_readahead
| | --46.88%--__do_page_cache_readahead
| | |--32.67%--ext4_mpage_readpages
| | | |--16.76%--submit_bio
| | |--10.23%--blk_finish_plug
[..]
63.64% 1.99% 1.99% 0.00% postgres [kernel.kallsyms] [k] do_syscall_64
|--61.65%--do_syscall_64
| |--59.66%--sys_pread64
| | vfs_read
| | --59.09%--__vfs_read
| | --58.24%--generic_file_read_iter
| | |--47.44%--ondemand_readahead
| | | --46.88%--__do_page_cache_readahead

61.36% 0.00% 0.00% 0.00% postgres postgres [.] FileRead
---FileRead
__pread_nocancel
--60.51%--entry_SYSCALL_64_after_hwframe
do_syscall_64
--59.66%--sys_pread64
vfs_read
--59.09%--__vfs_read
--58.24%--generic_file_read_iter
|--47.44%--ondemand_readahead
| --46.88%--__do_page_cache_readahead

61.36% 0.85% 0.00% 0.85% postgres libpthread-2.17.so [.] __pread_nocancel
|--60.51%--__pread_nocancel
| entry_SYSCALL_64_after_hwframe
| do_syscall_64
| --59.66%--sys_pread64
| vfs_read
| --59.09%--__vfs_read
| --58.24%--generic_file_read_iter
| |--47.44%--ondemand_readahead
| | --46.88%--__do_page_cache_readahead

59.66% 0.00% 0.00% 0.00% postgres [kernel.kallsyms] [k] sys_pread64
---sys_pread64
vfs_read
--59.09%--__vfs_read
--58.24%--generic_file_read_iter
|--47.44%--ondemand_readahead
| --46.88%--__do_page_cache_readahead
| |--32.67%--ext4_mpage_readpages

[..]
Perf --no-children also triple confirms that there isn't any function that is burning a lot inside the worker:

# Overhead sys usr Command Shared Object Symbol
5.40% 0.00% 5.40% postgres [vdso] [.] __vdso_clock_gettime
5.11% 0.00% 5.11% postgres postgres [.] acquire_sample_rows
---acquire_sample_rows
3.98% 0.00% 3.98% postgres postgres [.] heapam_scan_analyze_next_tuple
---heapam_scan_analyze_next_tuple
3.69% 3.69% 0.00% postgres [kernel.kallsyms] [k] pvclock_clocksource_read

My questions are:
a) does anybody know if it is expected that getrusage() doesn't include readahead times as current thread system time ? (I don't know by may be performed by other kernel threads?) ru_stime is defined as "This is the total amount of time spent executing in kernel mode". Maybe the "executing" is the keyword here? (waiting != executing?)

b) initially I've wanted to add a new pg_rusage_show_verbose() that would also add ru_inblock, but that wouldn't add much value to the end user. Also adding another timing directly around table_scan_analyze_next_block() seems like the bad idea as it involves locking underneah. So I've tried the most easy approach to simply log $pgStatBlockReadTime as strictly I/O time spent in pread() (ReadBuffer_common() already measures time). The attached patch for PgSQL14-devel in heavy I/O conditions (with track_io_timings=on) logs the following:
"LOG: automatic analyze of table "test.public.t1_default" system usage: IO read time 0.69 s, CPU: user: 0.18 s, system: 0.13 s, elapsed: 0.92 s"
my interpretation would be that IO reading time was most limiting factor (69/92 = 75%), but *CPU* on kernel side was just 13s. It could give the enduser/DBA the information needed, the information where's the bottleneck given the autovacuum_vacuum_cost_delay=0. In autovacuum_vacuum_cost_delay>0 maybe it would make sense to include also time spent on sleeping?

c) I'm curious if anybody has any I/O related insights into analyze.c processing especially related to readaheads? E.g. maybe disabling readahead would help for PostgreSQL analyze.c usecase on NVMe? Is it worth given that only x% of blocks are needed? The only option I'm aware would be to e.g. hash-partition the table (to introduce parallelism by autovacuums and enable even workers). Any hints or comments?

All of the above observations from PostgreSQL 12.4 on Linux kernel 4.14 with ext4/striped dm with 3x-4x NVMEs.

-Jakub Wartak.

Attachment Content-Type Size
0001-do_analyze_rel_ioreadtimings_patch.c text/plain 1.3 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Julien Rouhaud 2020-10-26 12:33:58 Re: Collation versioning
Previous Message Muhammad Usama 2020-10-26 10:52:46 Re: pgbench - add pseudo-random permutation function