Nested loop vs merge join: inconsistencies between estimated and actual time

From: Vlad Arkhipov <arhipov(at)dc(dot)baikal(dot)ru>
To: pgsql-performance(at)postgresql(dot)org
Subject: Nested loop vs merge join: inconsistencies between estimated and actual time
Date: 2008-03-07 05:50:40
Message-ID: 47D0D7B0.7060109@dc.baikal.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

I've came across this issue while writing report-like query for 2 not
very large tables. I've tried several methods to resolve this one (see
below). But now I'm really stuck...
PostgreSQL 8.3, default configuration

There are 2 tables (structure was simplified to show only problematic
place):
create table c
(
id bigint primary key
cdate date
);

create index c_cdate_idx on c (cdate);

create table i
(
id bigint primary key,
id_c bigint references c(id)
);

select count(*) from c

count
--------
636 565

select count(*) from i

count
--------
4 646 145

analyze i;
analyze c;

explain analyze
select id
from c
join i on i.idc = c.id
where c.cdate between '2007-02-01' and '2007-02-16'

QUERY
PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------

Merge Join (cost=738.95..57864.63 rows=14479 width=8) (actual
time=13954.681..14358.731 rows=14583
loops=1)
Merge Cond: (i.idc =
c.id)

-> Index Scan using fki_i_c_fk on i (cost=0.00..194324.34
rows=4646145 width=8) (actual time=17.254..12061.414 rows=1042599 loops=1)
-> Sort (cost=738.94..756.88 rows=7178 width=8) (actual
time=53.942..75.013 rows=14583
loops=1)
Sort Key:
c.id

Sort Method: quicksort Memory:
404kB

-> Index Scan using c_cdate_idx on c (cost=0.00..279.21
rows=7178 width=8) (actual time=23.595..41.470 rows=7064 loops=1)
Index Cond: ((cdate >= '2007-02-01'::date) AND (cdate <=
'2007-02-16'::date))
Total runtime: 14379.461
ms

set enable_mergejoin to off;
set enable_hashjoin to off;

QUERY
PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------

Nested Loop (cost=0.00..59833.70 rows=14479 width=8) (actual
time=0.129..153.038 rows=14583
loops=1)
-> Index Scan using c_cdate_idx on c (cost=0.00..279.21 rows=7178
width=8) (actual time=0.091..14.468 rows=7064 loops=1)
Index Cond: ((cdate >= '2007-02-01'::date) AND (cdate <=
'2007-02-16'::date))
-> Index Scan using fki_i_c_fk on i (cost=0.00..8.13 rows=13
width=8) (actual time=0.007..0.011 rows=2 loops=7064)
Index Cond: (i.idc =
c.id)

Total runtime: 172.599 ms

Ok, the first problem is here:
-> Index Scan using fki_i_c_fk on i (cost=0.00..8.13 rows=13
width=8) (actual time=0.007..0.011 rows=2 loops=7064)

I collected statistics for these tables at level 1000 for all columns.

select attname, null_frac, avg_width, n_distinct, correlation
from pg_stats
where tablename = 'i'

attname null_frac avg_width n_distinct
correlation
---------- ------------------ ------------ -------------
------------------
id 0 8 -1 0,9999849796295166
idc 0,7236369848251343 8 95583 0,999763011932373

Nice stats except of n_distinct for idc column.

select count(distinct idc)
from i

count
--------
633 864

Of course it is not correct solution but...

update pg_statistic
set stadistinct = 633864
where starelid = ... and staattnum = ...

Reconnect and execute:

explain analyze
select id
from c
join i on i.idc = c.id
where c.cdate between '2007-02-01' and '2007-02-16'

QUERY
PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------

Nested Loop (cost=0.00..57342.39 rows=14479 width=8) (actual
time=0.133..151.426 rows=14583
loops=1)
-> Index Scan using c_cdate_idx on c (cost=0.00..279.21 rows=7178
width=8) (actual time=0.094..14.242 rows=7064 loops=1)
Index Cond: ((cdate >= '2007-02-01'::date) AND (cdate <=
'2007-02-16'::date))
-> Index Scan using fki_i_c_fk on i (cost=0.00..7.92 rows=2 width=8)
(actual time=0.007..0.011 rows=2 loops=7064)
Index Cond: (i.idc =
c.id)

Total runtime: 170.911
ms

But the reason of this issue is not the incorrect value of n_distinct.
Let's expand dates interval in WHERE clause.

explain analyze
select id
from c
join i on i.idc = c.id
where c.cdate between '2007-02-01' and '2007-02-19'

QUERY
PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------

Merge Join (cost=831.16..57981.98 rows=16155 width=8) (actual
time=11691.156..12155.201 rows=16357
loops=1)
Merge Cond: (i.idc =
c.id)

-> Index Scan using fki_i_c_fk on i (cost=0.00..194324.34
rows=4646145 width=8) (actual time=22.236..9928.489 rows=1044373 loops=1)
-> Sort (cost=831.15..851.17 rows=8009 width=8) (actual
time=31.660..55.277 rows=16357
loops=1)
Sort Key:
c.id

Sort Method: quicksort Memory:
438kB

-> Index Scan using c_cdate_idx on c (cost=0.00..311.87
rows=8009 width=8) (actual time=0.116..17.050 rows=7918 loops=1)
Index Cond: ((cdate >= '2007-02-01'::date) AND (cdate <=
'2007-02-19'::date))
Total runtime: 12178.678 ms

set enable_mergejoin to off;
set enable_hashjoin to off;

QUERY
PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------

Nested Loop (cost=0.00..63724.20 rows=16155 width=8) (actual
time=0.131..171.292 rows=16357
loops=1)
-> Index Scan using c_cdate_idx on c (cost=0.00..311.87 rows=8009
width=8) (actual time=0.093..15.906 rows=7918 loops=1)
Index Cond: ((cdate >= '2007-02-01'::date) AND (cdate <=
'2007-02-19'::date))
-> Index Scan using fki_i_c_fk on i (cost=0.00..7.89 rows=2 width=8)
(actual time=0.007..0.011 rows=2 loops=7918)
Index Cond: (i.idc =
c.id)

Total runtime: 193.221 ms

Why nested loop is overestimated here (63000 estimated = 171 actual,
58000 estimated = 12155 actual).

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2008-03-07 06:35:46 Re: Nested loop vs merge join: inconsistencies between estimated and actual time
Previous Message Tom Lane 2008-03-07 05:40:13 Re: Improve Full text rank in a query