Re: Joining two large tables on a tiny subset of rows

From: Gnanavel S <s(dot)gnanavel(at)gmail(dot)com>
To: Dmitri Bichko <dbichko(at)aveopharma(dot)com>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: Joining two large tables on a tiny subset of rows
Date: 2005-07-27 03:48:41
Message-ID: eec3b03c050726204813908f41@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

On 7/27/05, Dmitri Bichko <dbichko(at)aveopharma(dot)com> wrote:
>
> Hello,
>
> I have two tables, one has a foreing key from the other (only showing
> the relevant columns and indices here):
>
> Table "expresso.probes"
> Column | Type | Modifiers
> -------------+------------------------+-----------
> platform_id | integer | not null
> probe_num | integer | not null
> mrna_acc | character varying(50) |
> Indexes:
> "idx_probes_mrna_acc" btree (mrna_acc, platform_id) WHERE mrna_acc
> IS NOT NULL
>
> Table "expresso.mrna_info"
> Column | Type | Modifiers
> -------------+-----------------------+-----------
> mrna_acc | character varying(25) | not null
> symbol | character varying(50) | not null
> Indexes:
> "idx_mrna_info_symbol" btree (upper(symbol::text)) WHERE symbol IS
> NOT NULL
>
> I want to select all the records in "probes" which have an "mrna_acc"
> that is associated with a specific symbol in "mrna_info".
>
> This works fine if I do it in two steps: select from "mrna_info" (12
> records out of ~250 thousand) and paste those into an IN clause to
> select from "probes" (83 records out of ~4.5 million).
>
> But if I try to do this as one query I always end up with a seq scan on
> "probes". I've tried doing it with joins, subselects, joining with a
> dynamic table (select from () as foo), nothing seems to get the desired
> result.
>
> Here are the explains for the two step process:
>
> expression=> EXPLAIN ANALYZE SELECT mrna_acc FROM mrna_info WHERE
> upper(symbol) = upper('pparg') AND symbol IS NOT NULL;
> QUERY PLAN
> ------------------------------------------------------------------------
> ---------------------------------------------------------------
> Index Scan using idx_mrna_info_symbol on mrna_info (cost=0.00..2934.78
> rows=930 width=12) (actual time=0.038..0.089 rows=12 loops=1)
> Index Cond: (upper((symbol)::text) = 'PPARG'::text)
> Filter: (symbol IS NOT NULL)
> Total runtime: 0.123 ms
> (4 rows)

Whether any records are returned by this query. Moreover "symbol is not
null" condition is not needed as that column cannot be null by table
definition.

EXPLAIN ANALYZE SELECT platform_id, probe_num FROM probes WHERE mrna_acc
> IN
> ('U10374','U09138','U01841','U01664','NM_015869','NM_013124','NM_011146'
> ,'NM_005037','D83233','BC021798','BC006811','AB011365') AND mrna_acc IS
> NOT NULL;
>
> QUERY PLAN
>
> ------------------------------------------------------------------------
> ------------------------------------------------------------------------
> ------------------------------------------------------------------------
> ------------------------------------------------------------------------
> ------------------------------------------------------------------------
> ------------------------------------------------------------------------
> --------------------------------------------------------------------
> Index Scan using idx_probes_mrna_acc, idx_probes_mrna_acc,
> idx_probes_mrna_acc, idx_probes_mrna_acc, idx_probes_mrna_acc,
> idx_probes_mrna_acc, idx_probes_mrna_acc, idx_probes_mrna_acc,
> idx_probes_mrna_acc, idx_probes_mrna_acc, idx_probes_mrna_acc,
> idx_probes_mrna_acc on probes (cost=0.00..14710.63 rows=4151 width=8)
> (actual time=0.040..0.719 rows=142 loops=1)
> Index Cond: (((mrna_acc)::text = 'U10374'::text) OR ((mrna_acc)::text
> = 'U09138'::text) OR ((mrna_acc)::text = 'U01841'::text) OR
> ((mrna_acc)::text = 'U01664'::text) OR ((mrna_acc)::text =
> 'NM_015869'::text) OR ((mrna_acc)::text = 'NM_013124'::text) OR
> ((mrna_acc)::text = 'NM_011146'::text) OR ((mrna_acc)::text =
> 'NM_005037'::text) OR ((mrna_acc)::text = 'D83233'::text) OR
> ((mrna_acc)::text = 'BC021798'::text) OR ((mrna_acc)::text =
> 'BC006811'::text) OR ((mrna_acc)::text = 'AB011365'::text))
> Filter: (mrna_acc IS NOT NULL)
> Total runtime: 0.844 ms
> (4 rows)
>
> And here is the explain of the join (it's essentially the same plan as
> the subselect and all the other ways I've tried):
>
> expression=> explain SELECT platform_id, probe_num FROM mrna_info m,
> probes p WHERE m.mrna_acc = p.mrna_acc and p.mrna_acc is not null and
> UPPER(symbol) = UPPER('pparg') AND m.symbol IS NOT NULL;

Here m.mrna_acc cannot be null, so condition on p.mrna_acc is not null is
useless as m.mrna_acc = p.mrna_acc (for null) will not give any records
Try to use outer join.

QUERY PLAN
> ------------------------------------------------------------------------
> ----------------------------------
> Hash Join (cost=2937.10..133392.04 rows=9939 width=8)
> Hash Cond: (("outer".mrna_acc)::text = ("inner".mrna_acc)::text)
> -> Seq Scan on probes p (cost=0.00..115434.22 rows=2984265
> width=19)
> Filter: (mrna_acc IS NOT NULL)
> -> Hash (cost=2934.78..2934.78 rows=930 width=12)
> -> Index Scan using idx_mrna_info_symbol on mrna_info m
> (cost=0.00..2934.78 rows=930 width=12)
> Index Cond: (upper((symbol)::text) = 'PPARG'::text)
> Filter: (symbol IS NOT NULL)
> (8 rows)
>
> What am I doing wrong?
>
> Dmitri
>
> The information transmitted is intended only for the person or entity to
> which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission, dissemination or other use of, or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipient is prohibited. If you received
> this in error, please contact the sender and delete the material from any
> computer
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org
>

--
with regards,
S.Gnanavel
Satyam Computer Services Ltd.

In response to

Browse pgsql-sql by date

  From Date Subject
Next Message Achilleus Mantzios 2005-07-27 06:08:30 Re: REINDEX DATABASE
Previous Message Dmitri Bichko 2005-07-26 19:04:21 Joining two large tables on a tiny subset of rows