question about executing JOINs

From: Josh Burdick <jburdick(at)gradient(dot)cis(dot)upenn(dot)edu>
To: pgsql-general(at)postgresql(dot)org
Subject: question about executing JOINs
Date: 2002-10-10 19:08:39
Message-ID: 3DA5D037.5050000@gradient.cis.upenn.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

We're using Postgres 7.2.1 in a biology lab. Some of our joins
involving selecting from two tables, but the fields we're selecting
aren't equal, they're just nearby.

select "tName" as "chrom",
[...stuff deleted...]
"tStart" - "snp_start" as "distance_from_start",
"snp_start" - "tEnd" as "distance_from_end"
from ucsc_ref_seq_ali_hg12 join ucsc_snp_tsc_hg12
on ucsc_snp_tsc_hg12."snp_chrom" = ucsc_ref_seq_ali_hg12."tName"
where "snp_start" >= "tStart" - 1000000
and "snp_start" <= "tEnd" + 1000000;

The problem is that the planner comes up with:

Merge Join (cost=201234.34..17463319.54 rows=85230539 width=51)
-> Sort (cost=1870.74..1870.74 rows=15165 width=29)
-> Seq Scan on ucsc_ref_seq_ali_hg12 (cost=0.00..817.65
rows=15165 width=29)
-> Sort (cost=199363.59..199363.59 rows=1145280 width=22)
-> Seq Scan on ucsc_snp_tsc_hg12 (cost=0.00..20996.80
rows=1145280 width=22)

which doesn't finish after 45 minutes, and presumably would take a
while to finish. However, by temporarily doing

set enable_mergejoin=off;
set enable_hashjoin=off;

the planner says

Nested Loop (cost=0.00..198183643.06 rows=85230539 width=51)
-> Seq Scan on ucsc_ref_seq_ali_hg12 (cost=0.00..817.65 rows=15165
width=29)
-> Index Scan using ucsc_snp_tsc_hg12_chrom_start on
ucsc_snp_tsc_hg12 (cost=0.00..12962.39 rows=4713 width=22)

which takes four minutes.
So, in practice, we just save the output of that in a table (using
CREATE TABLE AS), which works. It's just a bit awkward to have to
fiddle with those switches; it would be nice if the planner were a bit
cleverer in this case.
Is there a runtime parameter that seems likely to fix this?
(perhaps CPU_INDEX_TUPLE_COST).
The planner is assuming that it's a cross join: that we need all
pairs of records. It's not taking into account the WHERE clause which
restricts to a tiny fraction of the records. Perhaps the planner should
assume that a nested loop over an index scan only looks at 1% of its
records?
This would make cross joins more expensive. But most of my cross
joins have been accidental, when I've left out a join condition, so I
don't really mind them taking longer, because I'll just stop them after
20 minutes anyway :)
Note that this isn't a question of join order (I think); it's just a
question of which way to execute the join.
Thanks,
Josh

--
Josh Burdick
jburdick(at)gradient(dot)cis(dot)upenn(dot)edu
http://www.cis.upenn.edu/~jburdick

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Patrick Nelson 2002-10-10 19:13:57 RE in WHERE
Previous Message Magnus Naeslund(f) 2002-10-10 18:15:05 Re: connection watchdog