Re: Joining two large tables on a tiny subset of rows

From: Richard Huxton <dev(at)archonet(dot)com>
To: Dmitri Bichko <dbichko(at)aveopharma(dot)com>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: Joining two large tables on a tiny subset of rows
Date: 2005-07-27 07:06:20
Message-ID: 42E7326C.5060403@archonet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

Dmitri Bichko wrote:
> Hello,
>
> I have two tables, one has a foreing key from the other (only showing
> the relevant columns and indices here):
>
> Table "expresso.probes"
> Column | Type | Modifiers
> -------------+------------------------+-----------
> platform_id | integer | not null
> probe_num | integer | not null
> mrna_acc | character varying(50) |
> Indexes:
> "idx_probes_mrna_acc" btree (mrna_acc, platform_id) WHERE mrna_acc
> IS NOT NULL
>
> Table "expresso.mrna_info"
> Column | Type | Modifiers
> -------------+-----------------------+-----------
> mrna_acc | character varying(25) | not null
> symbol | character varying(50) | not null
> Indexes:
> "idx_mrna_info_symbol" btree (upper(symbol::text)) WHERE symbol IS
> NOT NULL

1. The two types of mrna_acc don't match - one has a max length of 25,
one 50. Why?
2. With idx_probes_mrna_acc, why WHERE mrna_acc IS NOT NULL? NULLs
aren't indexed anyway.
3. You say there is a foreign key, but I don't even see a primary key
anywhere. I'm guessing mrna_info.mrna_acc is the primary key for that table.

> Here are the explains for the two step process:
>
> expression=> EXPLAIN ANALYZE SELECT mrna_acc FROM mrna_info WHERE
> upper(symbol) = upper('pparg') AND symbol IS NOT NULL;

As someone else mentions, the IS NOT NULL is redundant.

> Index Scan using idx_mrna_info_symbol on mrna_info (cost=0.00..2934.78
> rows=930 width=12) (actual time=0.038..0.089 rows=12 loops=1)

Note that the estimated number of rows is wrong though (930 rather than
the actual 12).

> EXPLAIN ANALYZE SELECT platform_id, probe_num FROM probes WHERE mrna_acc
> IN
> ('U10374','U09138','U01841','U01664','NM_015869','NM_013124','NM_011146'
> ,'NM_005037','D83233','BC021798','BC006811','AB011365') AND mrna_acc IS
> NOT NULL;

Again, a redundant IS NOT NULL, which presumably you're putting in to
use the index.

> Index Scan using idx_probes_mrna_acc, idx_probes_mrna_acc,
> idx_probes_mrna_acc, idx_probes_mrna_acc, idx_probes_mrna_acc,
> idx_probes_mrna_acc, idx_probes_mrna_acc, idx_probes_mrna_acc,
> idx_probes_mrna_acc, idx_probes_mrna_acc, idx_probes_mrna_acc,
> idx_probes_mrna_acc on probes (cost=0.00..14710.63 rows=4151 width=8)
> (actual time=0.040..0.719 rows=142 loops=1)

Again, it's getting the row estimate badly wrong (4151 vs 142).

> And here is the explain of the join (it's essentially the same plan as
> the subselect and all the other ways I've tried):
>
> expression=> explain SELECT platform_id, probe_num FROM mrna_info m,
> probes p WHERE m.mrna_acc = p.mrna_acc and p.mrna_acc is not null and
> UPPER(symbol) = UPPER('pparg') AND m.symbol IS NOT NULL;

Well, firstly get rid of the redundant "IS NOT NULL"s in the query and
the indexes, then vacuum analyse the tables and post an EXPLAIN ANALYSE.

The problem will probably turn out to be poor row estimates (you can
increase the statistics gathered on the mrna_acc values) or poor
configuration settings (making indexes look expensive compared to
sequential scans).

--
Richard Huxton
Archonet Ltd

In response to

Browse pgsql-sql by date

  From Date Subject
Next Message Chris Browne 2005-07-27 15:31:00 Re: REINDEX DATABASE
Previous Message Achilleus Mantzios 2005-07-27 06:08:30 Re: REINDEX DATABASE