From: | <francisco(dot)santos(at)tagus(dot)ist(dot)utl(dot)pt> |
---|---|
To: | <pgsql-performance(at)postgresql(dot)org> |
Subject: | Convert IN sublink to join |
Date: | 2005-12-14 13:27:21 |
Message-ID: | 000001c600b2$1ccb44e0$017ba8c0@flintstone |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
Dear Sir or Madam:
The function "convert_IN_to_join(Query *parse, SubLink *sublink)", from
file: <postgres-8.0.4 root>/src/backend/optimizer/plan/subselect.c, is
responsible for converting IN type sublinks to joins, whenever appropriate.
The following lines of code, extracted from convert_IN_to_join, verify if
the subquery is correlated:
/*
* The sub-select must not refer to any Vars of the parent query.
* (Vars of higher levels should be okay, though.)
*/
if (contain_vars_of_level((Node *) subselect, 1))
return NULL;
By commenting this code region I was able to optimize several correlated
subqueries. Apparently, the rest of the PostgreSQL code is ready to handle
the "convert subquery to join" optimization. Is this test really necessary?
Please analyze the following example:
DDL:
CREATE TABLE "students" ( sid char(5) primary key, name char(20) not null,
age integer, email char(20) not null, unique( email ), avgrade float not
null );
CREATE TABLE "enrolled" ( sid char(5), cid char(5), grade real not null,
primary key(sid,cid), foreign key(sid) references students (sid) on delete
restrict );
DML:
1) correlated IN subquery:
"Find all students who's grade in 'TFC' class is higher than their average
grade."
select a.sid, a.name from students a where a.sid IN ( select i.sid from
enrolled i where i.grade > a.avgrade AND i.cid = 'TFC');
QUERY PLAN:
Seq Scan on students a (cost=0.00..5763804.50 rows=5000 width=33)
Filter: (subplan)
SubPlan
-> Seq Scan on enrolled i (cost=0.00..1144.00 rows=3473 width=9)
Filter: ((grade > $0) AND (cid = 'TFC'::bpchar))
2) the same query after commenting out the above code region in
convert_IN_to_join:
QUERY PLAN:
Hash Join (cost=1050.24..1518.21 rows=693 width=33)
Hash Cond: ("outer".sid = "inner".sid)
Join Filter: ("inner".grade > "outer".avgrade)
-> Seq Scan on students a (cost=0.00..367.00 rows=10000 width=41)
-> Hash (cost=1045.04..1045.04 rows=2078 width=13)
-> HashAggregate (cost=1045.04..1045.04 rows=2078 width=13)
-> Seq Scan on enrolled i (cost=0.00..1019.00 rows=10417
width=13)
Filter: (cid = 'TFC'::bpchar)
3) Clearly, it is possible to extract the IN subquery from query 1 since the
outer attribute a.sid matches, at most once, with the inner tuple i.sid.
Although s.sid is not a primary key by itself, together with "i.cid = 'TFC'"
conjunct, it forms a unique tuple. Here is an efficient alternative to query
1:
select a.sid, a.name from students a, enrolled i where a.sid = i.sid AND
i.cid = 'TFC' AND i.grade > a.avgrade;
QUERY PLAN:
Hash Join (cost=480.00..2366.86 rows=3473 width=33)
Hash Cond: ("outer".sid = "inner".sid)
Join Filter: ("outer".grade > "inner".avgrade)
-> Seq Scan on enrolled i (cost=0.00..1019.00 rows=10417 width=13)
Filter: (cid = 'TFC'::bpchar)
-> Hash (cost=367.00..367.00 rows=10000 width=41)
-> Seq Scan on students a (cost=0.00..367.00 rows=10000 width=41)
I have verified that both 2) and 3) return the exact same tuples, query 1)
never completed due to the highly inefficient execution plan.
Please help me with this issue.
Kind regards,
Francisco Santos
From | Date | Subject | |
---|---|---|---|
Next Message | John Sidney-Woollett | 2005-12-14 13:52:00 | Re: Memory Leakage Problem |
Previous Message | Martijn van Oosterhout | 2005-12-14 09:21:41 | Re: Memory Leakage Problem |