From: | Marti Raudsepp <marti(at)juffo(dot)org> |
---|---|
To: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Bogus nestloop join estimate, ignores WHERE clause |
Date: | 2011-08-31 12:38:50 |
Message-ID: | CABRT9RAWwn0KQ66K=7cvTuxyTzEHd6P_PxYDkT3gfwg2xqybHg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi list!
I'm getting really surprising planner estimates for a query that's
joining another table via a varchar field. All of this was tested on
PostgreSQL 8.4.8, 9.0.4 and 9.1rc1.
The original query is pretty huge, but I managed to shrink it down to this:
SELECT * FROM email_message where email_message.id_code IN (SELECT
id_code FROM client WHERE client_id='83509');
* id_code is an indexed varchar(20) NOT NULL column in both tables
* client_id is the primary key of client.
* There are 149152 rows in email_message and 140975 rows in client
* The most common value in both sides of the join is an empty string.
121970 in email_message and 10753 in client
(Turning the empty values into NULLs helps a little, but still gives
bad estimates)
This is the plan I get:
EXPLAIN SELECT * FROM email_message where email_message.id_code IN
(SELECT id_code FROM client WHERE client_id='83509');
Nested Loop (cost=8.28..36.86 rows=139542 width=101)
-> HashAggregate (cost=8.28..8.29 rows=1 width=11)
-> Index Scan using client_pkey on client (cost=0.00..8.28
rows=1 width=11)
Index Cond: (client_id = 83509)
-> Index Scan using email_message_id_code_idx on email_message
(cost=0.00..28.05 rows=41 width=101)
Index Cond: ((email_message.id_code)::text = (client.id_code)::text)
(6 rows)
This nestloop couldn't possibly generate 139542 rows since the inner
plan is expected to return 1 row and the outer plan 41
After a bit of digging, I figured out that it uses the same estimate
as a semi-join WITHOUT the client_id restriction.
EXPLAIN SELECT * FROM email_message m WHERE EXISTS(SELECT * FROM
client c WHERE m.id_code=c.id_code);
Nested Loop Semi Join (cost=0.00..7725.31 rows=139542 width=101)
-> Seq Scan on email_message m (cost=0.00..3966.52 rows=149152 width=101)
-> Index Scan using client_id_code_idx1 on client c
(cost=0.00..0.39 rows=1 width=11)
Index Cond: ((c.id_code)::text = (m.id_code)::text)
For whatever reason, the 1st query completely ignores the fact that
the client_id clause reduces the result count by a large factor.
So I turned this into a simple JOIN and I'm still seeing bad estimates:
EXPLAIN SELECT * FROM email_message JOIN client USING (id_code) WHERE
client_id='83509';
Nested Loop (cost=0.00..36.85 rows=9396 width=252)
-> Index Scan using client_pkey on client (cost=0.00..8.28 rows=1
width=162)
Index Cond: (client_id = 83509)
-> Index Scan using email_message_id_code_idx on email_message
(cost=0.00..28.05 rows=41 width=101)
Index Cond: ((email_message.id_code)::text = (client.id_code)::text)
This is better, but still overestimates massively.
When I change empty values to NULLs, then this JOIN query starts
estimating correctly. So this one is probably confused because the
empty values would result in a cartesian join.
Are there any reasons why nestloop can't use the known (1 * 41) as its estimate?
Regards,
Marti Raudsepp
voicecom.ee
From | Date | Subject | |
---|---|---|---|
Next Message | Heikki Linnakangas | 2011-08-31 13:05:31 | setlocale() on Windows is broken |
Previous Message | Bernd Helmle | 2011-08-31 12:13:11 | Informix FDW - anybody working on this? |