From: | Josh Burdick <jburdick(at)gradient(dot)cis(dot)upenn(dot)edu> |
---|---|
To: | pgsql-general(at)postgresql(dot)org |
Subject: | question about executing JOINs |
Date: | 2002-10-10 19:08:39 |
Message-ID: | 3DA5D037.5050000@gradient.cis.upenn.edu |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
We're using Postgres 7.2.1 in a biology lab. Some of our joins
involving selecting from two tables, but the fields we're selecting
aren't equal, they're just nearby.
select "tName" as "chrom",
[...stuff deleted...]
"tStart" - "snp_start" as "distance_from_start",
"snp_start" - "tEnd" as "distance_from_end"
from ucsc_ref_seq_ali_hg12 join ucsc_snp_tsc_hg12
on ucsc_snp_tsc_hg12."snp_chrom" = ucsc_ref_seq_ali_hg12."tName"
where "snp_start" >= "tStart" - 1000000
and "snp_start" <= "tEnd" + 1000000;
The problem is that the planner comes up with:
Merge Join (cost=201234.34..17463319.54 rows=85230539 width=51)
-> Sort (cost=1870.74..1870.74 rows=15165 width=29)
-> Seq Scan on ucsc_ref_seq_ali_hg12 (cost=0.00..817.65
rows=15165 width=29)
-> Sort (cost=199363.59..199363.59 rows=1145280 width=22)
-> Seq Scan on ucsc_snp_tsc_hg12 (cost=0.00..20996.80
rows=1145280 width=22)
which doesn't finish after 45 minutes, and presumably would take a
while to finish. However, by temporarily doing
set enable_mergejoin=off;
set enable_hashjoin=off;
the planner says
Nested Loop (cost=0.00..198183643.06 rows=85230539 width=51)
-> Seq Scan on ucsc_ref_seq_ali_hg12 (cost=0.00..817.65 rows=15165
width=29)
-> Index Scan using ucsc_snp_tsc_hg12_chrom_start on
ucsc_snp_tsc_hg12 (cost=0.00..12962.39 rows=4713 width=22)
which takes four minutes.
So, in practice, we just save the output of that in a table (using
CREATE TABLE AS), which works. It's just a bit awkward to have to
fiddle with those switches; it would be nice if the planner were a bit
cleverer in this case.
Is there a runtime parameter that seems likely to fix this?
(perhaps CPU_INDEX_TUPLE_COST).
The planner is assuming that it's a cross join: that we need all
pairs of records. It's not taking into account the WHERE clause which
restricts to a tiny fraction of the records. Perhaps the planner should
assume that a nested loop over an index scan only looks at 1% of its
records?
This would make cross joins more expensive. But most of my cross
joins have been accidental, when I've left out a join condition, so I
don't really mind them taking longer, because I'll just stop them after
20 minutes anyway :)
Note that this isn't a question of join order (I think); it's just a
question of which way to execute the join.
Thanks,
Josh
--
Josh Burdick
jburdick(at)gradient(dot)cis(dot)upenn(dot)edu
http://www.cis.upenn.edu/~jburdick
From | Date | Subject | |
---|---|---|---|
Next Message | Patrick Nelson | 2002-10-10 19:13:57 | RE in WHERE |
Previous Message | Magnus Naeslund(f) | 2002-10-10 18:15:05 | Re: connection watchdog |