Joining two large tables on a tiny subset of rows

From: "Dmitri Bichko" <dbichko(at)aveopharma(dot)com>
To: <pgsql-sql(at)postgresql(dot)org>
Subject: Joining two large tables on a tiny subset of rows
Date: 2005-07-26 19:04:21
Message-ID: F18A6F7CF1661F46920F2CF713122FED46CC35@mail.aveo.aveopharma.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

Hello,

I have two tables, one has a foreing key from the other (only showing
the relevant columns and indices here):

Table "expresso.probes"
Column | Type | Modifiers
-------------+------------------------+-----------
platform_id | integer | not null
probe_num | integer | not null
mrna_acc | character varying(50) |
Indexes:
"idx_probes_mrna_acc" btree (mrna_acc, platform_id) WHERE mrna_acc
IS NOT NULL

Table "expresso.mrna_info"
Column | Type | Modifiers
-------------+-----------------------+-----------
mrna_acc | character varying(25) | not null
symbol | character varying(50) | not null
Indexes:
"idx_mrna_info_symbol" btree (upper(symbol::text)) WHERE symbol IS
NOT NULL

I want to select all the records in "probes" which have an "mrna_acc"
that is associated with a specific symbol in "mrna_info".

This works fine if I do it in two steps: select from "mrna_info" (12
records out of ~250 thousand) and paste those into an IN clause to
select from "probes" (83 records out of ~4.5 million).

But if I try to do this as one query I always end up with a seq scan on
"probes". I've tried doing it with joins, subselects, joining with a
dynamic table (select from () as foo), nothing seems to get the desired
result.

Here are the explains for the two step process:

expression=> EXPLAIN ANALYZE SELECT mrna_acc FROM mrna_info WHERE
upper(symbol) = upper('pparg') AND symbol IS NOT NULL;
QUERY PLAN
------------------------------------------------------------------------
---------------------------------------------------------------
Index Scan using idx_mrna_info_symbol on mrna_info (cost=0.00..2934.78
rows=930 width=12) (actual time=0.038..0.089 rows=12 loops=1)
Index Cond: (upper((symbol)::text) = 'PPARG'::text)
Filter: (symbol IS NOT NULL)
Total runtime: 0.123 ms
(4 rows)

EXPLAIN ANALYZE SELECT platform_id, probe_num FROM probes WHERE mrna_acc
IN
('U10374','U09138','U01841','U01664','NM_015869','NM_013124','NM_011146'
,'NM_005037','D83233','BC021798','BC006811','AB011365') AND mrna_acc IS
NOT NULL;

QUERY PLAN

------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
--------------------------------------------------------------------
Index Scan using idx_probes_mrna_acc, idx_probes_mrna_acc,
idx_probes_mrna_acc, idx_probes_mrna_acc, idx_probes_mrna_acc,
idx_probes_mrna_acc, idx_probes_mrna_acc, idx_probes_mrna_acc,
idx_probes_mrna_acc, idx_probes_mrna_acc, idx_probes_mrna_acc,
idx_probes_mrna_acc on probes (cost=0.00..14710.63 rows=4151 width=8)
(actual time=0.040..0.719 rows=142 loops=1)
Index Cond: (((mrna_acc)::text = 'U10374'::text) OR ((mrna_acc)::text
= 'U09138'::text) OR ((mrna_acc)::text = 'U01841'::text) OR
((mrna_acc)::text = 'U01664'::text) OR ((mrna_acc)::text =
'NM_015869'::text) OR ((mrna_acc)::text = 'NM_013124'::text) OR
((mrna_acc)::text = 'NM_011146'::text) OR ((mrna_acc)::text =
'NM_005037'::text) OR ((mrna_acc)::text = 'D83233'::text) OR
((mrna_acc)::text = 'BC021798'::text) OR ((mrna_acc)::text =
'BC006811'::text) OR ((mrna_acc)::text = 'AB011365'::text))
Filter: (mrna_acc IS NOT NULL)
Total runtime: 0.844 ms
(4 rows)

And here is the explain of the join (it's essentially the same plan as
the subselect and all the other ways I've tried):

expression=> explain SELECT platform_id, probe_num FROM mrna_info m,
probes p WHERE m.mrna_acc = p.mrna_acc and p.mrna_acc is not null and
UPPER(symbol) = UPPER('pparg') AND m.symbol IS NOT NULL;
QUERY PLAN
------------------------------------------------------------------------
----------------------------------
Hash Join (cost=2937.10..133392.04 rows=9939 width=8)
Hash Cond: (("outer".mrna_acc)::text = ("inner".mrna_acc)::text)
-> Seq Scan on probes p (cost=0.00..115434.22 rows=2984265
width=19)
Filter: (mrna_acc IS NOT NULL)
-> Hash (cost=2934.78..2934.78 rows=930 width=12)
-> Index Scan using idx_mrna_info_symbol on mrna_info m
(cost=0.00..2934.78 rows=930 width=12)
Index Cond: (upper((symbol)::text) = 'PPARG'::text)
Filter: (symbol IS NOT NULL)
(8 rows)

What am I doing wrong?

Dmitri

The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Gnanavel S 2005-07-27 03:48:41 Re: Joining two large tables on a tiny subset of rows
Previous Message Chris Browne 2005-07-26 17:14:42 Re: REINDEX DATABASE