From: | Vlad Arkhipov <arhipov(at)dc(dot)baikal(dot)ru> |
---|---|
To: | pgsql-performance(at)postgresql(dot)org |
Subject: | Nested loop vs merge join: inconsistencies between estimated and actual time |
Date: | 2008-03-07 05:50:40 |
Message-ID: | 47D0D7B0.7060109@dc.baikal.ru |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
I've came across this issue while writing report-like query for 2 not
very large tables. I've tried several methods to resolve this one (see
below). But now I'm really stuck...
PostgreSQL 8.3, default configuration
There are 2 tables (structure was simplified to show only problematic
place):
create table c
(
id bigint primary key
cdate date
);
create index c_cdate_idx on c (cdate);
create table i
(
id bigint primary key,
id_c bigint references c(id)
);
select count(*) from c
count
--------
636 565
select count(*) from i
count
--------
4 646 145
analyze i;
analyze c;
explain analyze
select id
from c
join i on i.idc = c.id
where c.cdate between '2007-02-01' and '2007-02-16'
QUERY
PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
Merge Join (cost=738.95..57864.63 rows=14479 width=8) (actual
time=13954.681..14358.731 rows=14583
loops=1)
Merge Cond: (i.idc =
c.id)
-> Index Scan using fki_i_c_fk on i (cost=0.00..194324.34
rows=4646145 width=8) (actual time=17.254..12061.414 rows=1042599 loops=1)
-> Sort (cost=738.94..756.88 rows=7178 width=8) (actual
time=53.942..75.013 rows=14583
loops=1)
Sort Key:
c.id
Sort Method: quicksort Memory:
404kB
-> Index Scan using c_cdate_idx on c (cost=0.00..279.21
rows=7178 width=8) (actual time=23.595..41.470 rows=7064 loops=1)
Index Cond: ((cdate >= '2007-02-01'::date) AND (cdate <=
'2007-02-16'::date))
Total runtime: 14379.461
ms
set enable_mergejoin to off;
set enable_hashjoin to off;
QUERY
PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..59833.70 rows=14479 width=8) (actual
time=0.129..153.038 rows=14583
loops=1)
-> Index Scan using c_cdate_idx on c (cost=0.00..279.21 rows=7178
width=8) (actual time=0.091..14.468 rows=7064 loops=1)
Index Cond: ((cdate >= '2007-02-01'::date) AND (cdate <=
'2007-02-16'::date))
-> Index Scan using fki_i_c_fk on i (cost=0.00..8.13 rows=13
width=8) (actual time=0.007..0.011 rows=2 loops=7064)
Index Cond: (i.idc =
c.id)
Total runtime: 172.599 ms
Ok, the first problem is here:
-> Index Scan using fki_i_c_fk on i (cost=0.00..8.13 rows=13
width=8) (actual time=0.007..0.011 rows=2 loops=7064)
I collected statistics for these tables at level 1000 for all columns.
select attname, null_frac, avg_width, n_distinct, correlation
from pg_stats
where tablename = 'i'
attname null_frac avg_width n_distinct
correlation
---------- ------------------ ------------ -------------
------------------
id 0 8 -1 0,9999849796295166
idc 0,7236369848251343 8 95583 0,999763011932373
Nice stats except of n_distinct for idc column.
select count(distinct idc)
from i
count
--------
633 864
Of course it is not correct solution but...
update pg_statistic
set stadistinct = 633864
where starelid = ... and staattnum = ...
Reconnect and execute:
explain analyze
select id
from c
join i on i.idc = c.id
where c.cdate between '2007-02-01' and '2007-02-16'
QUERY
PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..57342.39 rows=14479 width=8) (actual
time=0.133..151.426 rows=14583
loops=1)
-> Index Scan using c_cdate_idx on c (cost=0.00..279.21 rows=7178
width=8) (actual time=0.094..14.242 rows=7064 loops=1)
Index Cond: ((cdate >= '2007-02-01'::date) AND (cdate <=
'2007-02-16'::date))
-> Index Scan using fki_i_c_fk on i (cost=0.00..7.92 rows=2 width=8)
(actual time=0.007..0.011 rows=2 loops=7064)
Index Cond: (i.idc =
c.id)
Total runtime: 170.911
ms
But the reason of this issue is not the incorrect value of n_distinct.
Let's expand dates interval in WHERE clause.
explain analyze
select id
from c
join i on i.idc = c.id
where c.cdate between '2007-02-01' and '2007-02-19'
QUERY
PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
Merge Join (cost=831.16..57981.98 rows=16155 width=8) (actual
time=11691.156..12155.201 rows=16357
loops=1)
Merge Cond: (i.idc =
c.id)
-> Index Scan using fki_i_c_fk on i (cost=0.00..194324.34
rows=4646145 width=8) (actual time=22.236..9928.489 rows=1044373 loops=1)
-> Sort (cost=831.15..851.17 rows=8009 width=8) (actual
time=31.660..55.277 rows=16357
loops=1)
Sort Key:
c.id
Sort Method: quicksort Memory:
438kB
-> Index Scan using c_cdate_idx on c (cost=0.00..311.87
rows=8009 width=8) (actual time=0.116..17.050 rows=7918 loops=1)
Index Cond: ((cdate >= '2007-02-01'::date) AND (cdate <=
'2007-02-19'::date))
Total runtime: 12178.678 ms
set enable_mergejoin to off;
set enable_hashjoin to off;
QUERY
PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..63724.20 rows=16155 width=8) (actual
time=0.131..171.292 rows=16357
loops=1)
-> Index Scan using c_cdate_idx on c (cost=0.00..311.87 rows=8009
width=8) (actual time=0.093..15.906 rows=7918 loops=1)
Index Cond: ((cdate >= '2007-02-01'::date) AND (cdate <=
'2007-02-19'::date))
-> Index Scan using fki_i_c_fk on i (cost=0.00..7.89 rows=2 width=8)
(actual time=0.007..0.011 rows=2 loops=7918)
Index Cond: (i.idc =
c.id)
Total runtime: 193.221 ms
Why nested loop is overestimated here (63000 estimated = 171 actual,
58000 estimated = 12155 actual).
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2008-03-07 06:35:46 | Re: Nested loop vs merge join: inconsistencies between estimated and actual time |
Previous Message | Tom Lane | 2008-03-07 05:40:13 | Re: Improve Full text rank in a query |