From: | Justin Pryzby <pryzby(at)telsasoft(dot)com> |
---|---|
To: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | open/lseek overhead with many partitions (including with "faster partitioning pruning") |
Date: | 2018-04-03 19:46:13 |
Message-ID: | 20180403194613.GY28454@telsasoft.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
TLDR: even with "faster pruning" patch, pg11dev open()+lseek() every file
backing every table being queried, even for those partitions eventually
"excluded".
One of our customers (with the largest number of child tables, currently of
relkind='r') takes a long time to plan query, and I wondered whether we'd
benefit from ongoing work, if I should plan to convert to relkind='p', and if I
could start testing that for our use cases.
I found by stracing that the backend is open()ing + lseek(SEEK_END) each file
in each relation being UNIONed in this view, including relations which were
excluded by constraints. Their tables are partitioned by day (some are still
40+GB), so there's only 10-20 tables being included in the above query, total
size perhaps ~50GB and under 50 backing files, but accessing a LOT more files
than needed:
cold backend:
[pryzbyj(at)database ~]$ grep open /tmp/strace-pg12-0|cut -d/ -f1,2 |sort |uniq -c |sort -nr |head
26593 open("pg_tblspc/16400
6894 open("pg_tblspc/16401
482 open("base/16402
warm backend:
[pryzbyj(at)database ~]$ grep open /tmp/strace-pg12-1|cut -d/ -f1,2 |sort |uniq -c |sort -nr |head
6545 open("pg_tblspc/16401
1338 open("pg_tblspc/16400
254 open("base/16402
I was curious if that was improved in pg11dev, and if it applied to relkind='r'
or only to relkind='p'. I applied v47 patches for "path toward faster
partition pruning", and was able to quickly create synthetic "huge table" using
truncate (and verified that smgr logic actually opens all 1e5 files).
$ psql -d postgres -h /tmp --port 5678 -c "CREATE TABLE t(i int) PARTITION BY range(i)"
$ for a in `seq 1 999`; do psql -d postgres -h /tmp --port 5678 -c "CREATE TABLE t$a PARTITION OF t FOR VALUES FROM ($a)TO($((1+a)))"; done
$ time for a in ./test11dev/base/13236/1???[0-9]; do echo "$a..."; truncate -s 1G $a; done # make existing, empty files 1GB..
$ time for a in ./test11dev/base/13236/1???[0-9]; do echo "$a..."; for b in `seq 1 99`; do truncate -s 1G $a.$b; done; done # make new 1GB tables with .n extension
postgres=# explain SELECT COUNT(1) FROM t WHERE i=-99 LIMIT 1;
Limit (cost=0.00..0.01 rows=1 width=8)
-> Aggregate (cost=0.00..0.01 rows=1 width=8)
-> Result (cost=0.00..0.00 rows=0 width=0)
One-Time Filter: false
Time: 63268.474 ms (01:03.268)
=> zfs/fuse is certainly exacerbating the issue here due to context switches; but..
[pryzbyj(at)database postgresql]$ time cut -d\( -f1 /tmp/strace-pg11dev-faster-15|sort |uniq -c |sort -nr |head
100899 open
99900 lseek
98918 close
118 brk
postgres is still open()+lseek() on 100k files, most of which are for relations
which were clearly excluded:
(gdb) bt
#0 0x00000032b2adb580 in __open_nocancel () from /lib64/libc.so.6
#1 0x0000000000711827 in BasicOpenFilePerm (fileName=0x1e2c9d0 "base/13236/16759.10", fileFlags=2, fileMode=384) at fd.c:965
#2 0x00000000007121ea in PathNameOpenFilePerm (fileName=0x1e2c9d0 "base/13236/16759.10", fileFlags=2, fileMode=384) at fd.c:1394
#3 0x0000000000733f6e in _mdfd_openseg (reln=0x1e14d98, forknum=MAIN_FORKNUM, segno=10, oflags=0) at md.c:1783
#4 0x0000000000734026 in mdnblocks (reln=0x1e14d98, forknum=MAIN_FORKNUM) at md.c:918
#5 0x00000000006b26bc in estimate_rel_size (rel=0x7f3f707f4b98, attr_widths=0x1e2cd24, pages=<value optimized out>, tuples=0x1e2cb98, allvisfrac=<value optimized out>) at plancat.c:946
#6 0x00000000006b3a47 in get_relation_info (root=0x1bd9030, relationObjectId=16759, inhparent=false, rel=0x1e2cae0) at plancat.c:144
#7 0x00000000006b78fa in build_simple_rel (root=0x1bd9030, relid=126, parent=0x1bda578) at relnode.c:185
#8 0x00000000006b7990 in build_simple_rel (root=0x1bd9030, relid=1, parent=0x0) at relnode.c:251
#9 0x0000000000691be3 in add_base_rels_to_query (root=0x1bd9030, jtnode=<value optimized out>) at initsplan.c:121
#10 0x00000000006924af in query_planner (root=<value optimized out>, tlist=0x1d34fd8, qp_callback=0x693940 <standard_qp_callback>, qp_extra=0x7ffe0a077290) at planmain.c:138
#11 0x000000000069777e in grouping_planner (root=<value optimized out>, inheritance_update=false, tuple_fraction=<value optimized out>) at planner.c:1892
#12 0x0000000000698ef7 in subquery_planner (glob=<value optimized out>, parse=<value optimized out>, parent_root=<value optimized out>, hasRecursion=<value optimized out>, tuple_fraction=0) at planner.c:966
#13 0x0000000000699d97 in standard_planner (parse=0x1bd9778, cursorOptions=256, boundParams=<value optimized out>) at planner.c:405
#14 0x0000000000739d2a in pg_plan_query (querytree=<value optimized out>, cursorOptions=<value optimized out>, boundParams=<value optimized out>) at postgres.c:808
#15 0x00000000005a0bd6 in ExplainOneQuery (query=0x1bd9778, cursorOptions=<value optimized out>, into=0x0, es=0x1bd8f58, queryString=0x1ae0fb0 "explain SELECT COUNT(1) FROM t WHERE i=-99 LIMIT 1;", params=0x0, queryEnv=0x0)
at explain.c:365
#16 0x00000000005a0e5d in ExplainQuery (pstate=0x1bd8d80, stmt=0x1ae1fa0, queryString=0x1ae0fb0 "explain SELECT COUNT(1) FROM t WHERE i=-99 LIMIT 1;", params=0x0, queryEnv=0x0, dest=0x1bd8e90) at explain.c:254
#17 0x00000000007408f2 in standard_ProcessUtility (pstmt=0x1ae2050, queryString=0x1ae0fb0 "explain SELECT COUNT(1) FROM t WHERE i=-99 LIMIT 1;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x1bd8e90,
completionTag=0x7ffe0a077720 "") at utility.c:670
#18 0x000000000073c789 in PortalRunUtility (portal=0x1b46780, pstmt=0x1ae2050, isTopLevel=<value optimized out>, setHoldSnapshot=<value optimized out>, dest=0x1bd8e90, completionTag=<value optimized out>) at pquery.c:1178
#19 0x000000000073d9aa in FillPortalStore (portal=0x1b46780, isTopLevel=true) at pquery.c:1038
#20 0x000000000073de80 in PortalRun (portal=0x1b46780, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x1ae2618, altdest=0x1ae2618, completionTag=0x7ffe0a077920 "") at pquery.c:768
#21 0x000000000073a39d in exec_simple_query (query_string=0x1ae0fb0 "explain SELECT COUNT(1) FROM t WHERE i=-99 LIMIT 1;") at postgres.c:1121
#22 0x000000000073b341 in PostgresMain (argc=<value optimized out>, argv=<value optimized out>, dbname=0x1b0ac80 "postgres", username=<value optimized out>) at postgres.c:4149
#23 0x00000000006cd4f3 in BackendRun (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:4409
#24 BackendStartup (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:4081
#25 ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1754
#26 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1362
#27 0x00000000006477d0 in main (argc=7, argv=0x1adba20) at main.c:228
I haven't been following the development patch, so don't know if that's
expected to be improved, or a possible future improvement, or ...
Also, I wondered if it'd be preferable for mdnblocks/estimate_rel_size to
l/stat() rather than open+lseek, which can require opening large number of
rarely-queried files and mess up fd.c LRU cache.
Justin
From | Date | Subject | |
---|---|---|---|
Next Message | Tomas Vondra | 2018-04-03 19:50:55 | Re: WIP: BRIN multi-range indexes |
Previous Message | Alvaro Herrera | 2018-04-03 19:11:35 | Re: Foreign keys and partitioned tables |