Convert IN sublink to join

From: <francisco(dot)santos(at)tagus(dot)ist(dot)utl(dot)pt>
To: <pgsql-performance(at)postgresql(dot)org>
Subject: Convert IN sublink to join
Date: 2005-12-14 13:27:21
Message-ID: 000001c600b2$1ccb44e0$017ba8c0@flintstone
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Dear Sir or Madam:

The function "convert_IN_to_join(Query *parse, SubLink *sublink)", from
file: <postgres-8.0.4 root>/src/backend/optimizer/plan/subselect.c, is
responsible for converting IN type sublinks to joins, whenever appropriate.

The following lines of code, extracted from convert_IN_to_join, verify if
the subquery is correlated:

/*
* The sub-select must not refer to any Vars of the parent query.
* (Vars of higher levels should be okay, though.)
*/
if (contain_vars_of_level((Node *) subselect, 1))
return NULL;

By commenting this code region I was able to optimize several correlated
subqueries. Apparently, the rest of the PostgreSQL code is ready to handle
the "convert subquery to join" optimization. Is this test really necessary?

Please analyze the following example:

DDL:
CREATE TABLE "students" ( sid char(5) primary key, name char(20) not null,
age integer, email char(20) not null, unique( email ), avgrade float not
null );
CREATE TABLE "enrolled" ( sid char(5), cid char(5), grade real not null,
primary key(sid,cid), foreign key(sid) references students (sid) on delete
restrict );

DML:

1) correlated IN subquery:
"Find all students who's grade in 'TFC' class is higher than their average
grade."
select a.sid, a.name from students a where a.sid IN ( select i.sid from
enrolled i where i.grade > a.avgrade AND i.cid = 'TFC');

QUERY PLAN:

Seq Scan on students a (cost=0.00..5763804.50 rows=5000 width=33)
Filter: (subplan)
SubPlan
-> Seq Scan on enrolled i (cost=0.00..1144.00 rows=3473 width=9)
Filter: ((grade > $0) AND (cid = 'TFC'::bpchar))

2) the same query after commenting out the above code region in
convert_IN_to_join:

QUERY PLAN:

Hash Join (cost=1050.24..1518.21 rows=693 width=33)
Hash Cond: ("outer".sid = "inner".sid)
Join Filter: ("inner".grade > "outer".avgrade)
-> Seq Scan on students a (cost=0.00..367.00 rows=10000 width=41)
-> Hash (cost=1045.04..1045.04 rows=2078 width=13)
-> HashAggregate (cost=1045.04..1045.04 rows=2078 width=13)
-> Seq Scan on enrolled i (cost=0.00..1019.00 rows=10417
width=13)
Filter: (cid = 'TFC'::bpchar)

3) Clearly, it is possible to extract the IN subquery from query 1 since the
outer attribute a.sid matches, at most once, with the inner tuple i.sid.
Although s.sid is not a primary key by itself, together with "i.cid = 'TFC'"
conjunct, it forms a unique tuple. Here is an efficient alternative to query
1:

select a.sid, a.name from students a, enrolled i where a.sid = i.sid AND
i.cid = 'TFC' AND i.grade > a.avgrade;

QUERY PLAN:

Hash Join (cost=480.00..2366.86 rows=3473 width=33)
Hash Cond: ("outer".sid = "inner".sid)
Join Filter: ("outer".grade > "inner".avgrade)
-> Seq Scan on enrolled i (cost=0.00..1019.00 rows=10417 width=13)
Filter: (cid = 'TFC'::bpchar)
-> Hash (cost=367.00..367.00 rows=10000 width=41)
-> Seq Scan on students a (cost=0.00..367.00 rows=10000 width=41)

I have verified that both 2) and 3) return the exact same tuples, query 1)
never completed due to the highly inefficient execution plan.
Please help me with this issue.

Kind regards,
Francisco Santos

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message John Sidney-Woollett 2005-12-14 13:52:00 Re: Memory Leakage Problem
Previous Message Martijn van Oosterhout 2005-12-14 09:21:41 Re: Memory Leakage Problem