Re: experiments in query optimization

From: Faheem Mitha <faheem(at)email(dot)unc(dot)edu>
To: Eliot Gable <egable+pgsql-performance(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, pgsql-performance(at)postgresql(dot)org
Subject: Re: experiments in query optimization
Date: 2010-04-01 18:15:13
Message-ID: alpine.DEB.2.00.1004012230520.17251@orwell.homelinux.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


Hi Eliot,

Thanks for the comment.

On Thu, 1 Apr 2010, Eliot Gable wrote:

> On Thu, Apr 1, 2010 at 7:46 AM, Faheem Mitha <faheem(at)email(dot)unc(dot)edu> wrote:

> Looking at this more closely, idlink_id and anno_id are primary keys, so
> already have indexes on them, so my understanding (from the docs) is
> there is no purpose in creating them. That's why I removed the indexes
> that were there (back last August, actually, according to my logs).
> Anyway, doesn't look there is anything I can do here. Does anyone have
> additions or corrections to this?

> When you do a join, you typically have a foreign key in one table
> referencing a primary key in another table. While designating a foreign
> key does put a constraint on the key to ensure referential integrity, it
> does not put an index on the column that is being designated as a
> foreign key. If I understand correctly, the scan done as the inner loop
> of the nested loop scan for the join is going to be your foreign key
> column, not your primary key column. Thus, if you have no index on the
> foreign key column, you will be forced to do a sequential table scan to
> do the join. In that case the hash-based join will almost certainly be
> faster (especially for such a large number of rows). If you put an index
> on the foreign key, then the inner scan can be an index scan and that
> might turn out to be faster than building the hash indexes on all the
> table rows.

> Somebody can correct me if I'm wrong.

I had set the foreign keys in question (on the geno table) to be primary
keys. This is because this setup is basically a glorified spreadsheet, and
I don't want more than one cell corresponding to a particular tuple of
idlink.id and anno.id (the conceptual rows and cols). Since a primary key
defines an index, I thought putting indexes on idlink_id and anno_id was
redundant. However, it looks like (unsurprisingly) the index corresponding
to the primary key is across both columns, which may not be what is wanted
for the aforesaid join. Ie.

ALTER TABLE ONLY geno ADD CONSTRAINT geno_pkey PRIMARY KEY (idlink_id, anno_id)

(As a side comment, with respect to the indexes on the other side of the
joins, in one case, we have idlink.id = geno.idlink_id, and idlink.id is a
primary key too. In the other, namely geno.anno_id =
dedup_patient_anno.id, dedup_patient_anno is a CTE, so no index on
dedup_patient_anno.id. But maybe indexes aren't needed there.)

Here is the join

SELECT decode_genotype(geno.snpval_id, %(allelea)s, %(alleleb)s) AS g,
geno.idlink_id, geno.anno_id
FROM geno
INNER JOIN dedup_patient_anno
ON geno.anno_id = dedup_patient_anno.id
INNER JOIN idlink
ON geno.idlink_id = idlink.id
ORDER BY idlink_id, anno_id

Here is the table dump.

****************************************************************
-- Name: geno; Type: TABLE; Schema: hapmap; Owner: snp; Tablespace:
--
CREATE TABLE geno (
idlink_id integer NOT NULL,
anno_id integer NOT NULL,
snpval_id integer NOT NULL
)
WITH (autovacuum_enabled=true);

ALTER TABLE hapmap.geno OWNER TO snp;
--
-- Name: geno_pkey; Type: CONSTRAINT; Schema: hapmap; Owner: snp;
Tablespace:
--
ALTER TABLE ONLY geno
ADD CONSTRAINT geno_pkey PRIMARY KEY (idlink_id, anno_id); (!!!!)
--
-- Name: geno_anno_id_fkey; Type: FK CONSTRAINT; Schema: hapmap; Owner:
snp
--
ALTER TABLE ONLY geno
ADD CONSTRAINT geno_anno_id_fkey FOREIGN KEY (anno_id) REFERENCES
anno(id) ON UPDATE CASCADE ON DELETE CASCADE;
--
-- Name: geno_idlink_id_fkey; Type: FK CONSTRAINT; Schema: hapmap; Owner:
snp
--
ALTER TABLE ONLY geno
ADD CONSTRAINT geno_idlink_id_fkey FOREIGN KEY (idlink_id) REFERENCES
idlink(id) ON UPDATE CASCADE ON DELETE CASCADE;
--
-- Name: geno_snpval_id_fkey; Type: FK CONSTRAINT; Schema: hapmap; Owner:
snp
--
ALTER TABLE ONLY geno
ADD CONSTRAINT geno_snpval_id_fkey FOREIGN KEY (snpval_id) REFERENCES
snpval(val) ON UPDATE CASCADE ON DELETE CASCADE;
*************************************************************************

So, should I add indexes on the individual foreign key cols idlink_id
and anno_id after all?

Regards, Faheem.

> --
> Eliot Gable
>
> "We do not inherit the Earth from our ancestors: we borrow it from our
> children." ~David Brower

> "I decided the words were too conservative for me. We're not borrowing
> from our children, we're stealing from them--and it's not even
> considered to be a crime." ~David Brower

Nice quotes.

> "Esse oportet ut vivas, non vivere ut edas." (Thou shouldst eat to live;
> not live to eat.) ~Marcus Tullius Cicero

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Robert Haas 2010-04-01 18:50:59 Re: experiments in query optimization
Previous Message Eliot Gable 2010-04-01 16:31:02 Re: experiments in query optimization