self join, parameterized base/join rel path row estimation and generally...

From: Bongseo Jang <graycells(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: self join, parameterized base/join rel path row estimation and generally...
Date: 2014-01-27 05:58:41
Message-ID: CAKPsnf1+ZU0W=UYPB_eAi3u8qATx43tny5Hba4x_B_T71qUa6A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi,

I have two plans of a query.
nestloop plan is much faster, but planner chose the slower one, hashjoin.

http://explain.depesz.com/s/Aqs
http://explain.depesz.com/s/97C

it seems that rows=39698995 are quite overestimated.

-> Nested Loop (cost=0.000..5403.600 rows=39698995 width=45) (actual
time=0.392..14.817 rows=943 loops=1)
-> Nested Loop (cost=0.000..17.600 rows=1 width=8) (actual
time=0.241..0.246 rows=1 loops=1)
-> Index Scan using seven on hotel three (cost=0.000..6.880 rows=1
width=6) (actual time=0.113..0.115 rows=1 loops=1)
Index Cond: (two = 31750::numeric)
-> Index Scan using echo on oscar_foxtrot november
(cost=0.000..10.710 rows=1 width=14) (actual time=0.117..0.118 rows=1
loops=1)
Index Cond: (charlie = three.golf)
-> Index Scan using zulu on oscar_foxtrot juliet (cost=0.000..3849.200
rows=153679 width=45) (actual time=0.147..14.241 rows=943 loops=1)
Index Cond: ((uniform_yankee = november.uniform_yankee) AND
(uniform_victor = november.uniform_victor))

pg_stats is like this;
> select attname, null_frac, n_distinct, most_common_vals,
most_common_freqs from pg_stats where tablename like 'oscar_foxtrot%' and
(attname = 'uniform_yankee' or attname = 'uniform_victor')
"uniform_yankee";0;12;"{83886082,83886085}";"{0.9742,0.02}"
"uniform_victor";0;23;"{1342767106,1342308357}";"{0.973467,0.02}"

I assumed that nestloop rows would be more or less inner_path_rows *
outer_path_rows with good pg_stats, and good plan could come based on it.

the plan above is not that case. Suspcious of 40 million rows and small
number of values(actually two values) making up 98% of distribution.
so.. I looked up some code and found that rows=153679 is rows of
parameterized base rel estimated by eqsel(), and row=39698995 is rows of
parameterized join rel by eqjoinsel().
I think wrong plan above comes from the fact that the two estimation cannot
be close in general, great difference in my case.

where am i wrong and right?
Is there recommended approach, related issue, commit and so on i can follow
?

thanks

[[nstallation Info]]
PostgreSQL-9.2.5 (via postgresql yum repository)
OS: Centos 6.3 (custom linux-3.10.12 kernel)
postgresql.conf:
effective_cache_size = 10000MB
shared_buffers = 1000MB
work_mem = 100MB
maintenance_work_mem = 100MB
HW: CPU 4-core Xeon x 2 sockets, RAM 256GB

--
Regards,
Jang.

a sound mind in a sound body

Browse pgsql-performance by date

  From Date Subject
Next Message Ying He 2014-01-27 14:15:06 Re: pg_repack solves alter table set tablespace lock
Previous Message RAMAKRISHNAN KANDASAMY 2014-01-25 06:32:59 PostgreSQL 9.3.2 Performance tuning for 32 GB server