From: | Manfred Koizar <mkoi-pg(at)aon(dot)at> |
---|---|
To: | Chantal Ackermann <chantal(dot)ackermann(at)biomax(dot)de> |
Cc: | Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-performance(at)postgresql(dot)org, tgl(at)sss(dot)pgh(dot)pa(dot)us |
Subject: | Re: cost and actual time |
Date: | 2003-02-18 17:31:48 |
Message-ID: | ajo45v4o0op0f3qnqh1le2rs79008fqp5u@4ax.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
On Tue, 18 Feb 2003 11:28:40 +0100, Chantal Ackermann
<chantal(dot)ackermann(at)biomax(dot)de> wrote:
>1. the old query, leaving out the table gene and setting
>gene_occurrences.gene_id to a certain gene_id, or the function
>get_gene_id, respectively. (This is the query you suggested, Manfred.)
This was Tom's suggestion. I might have ended up there in a day or
two :-)
>What takes up most of the runtime the Nested Loop (for the join of
>disease and disease_occurrences, or rather for joining both occurrences
>tables? I'm not sure which rows belong together in the explain output).
... for joining both occurrences: The "-> Nested Loop" takes two
tables (the "-> Index Scans") as input and produces one table as
output which is again used as input for the "-> Hash Join" above it.
>2. The new query, same changes (gene left out, subselect replaced with
>get_gene_id):
>
>EXPLAIN ANALYZE
> SELECT disease.disease_name, count(disease.disease_name) AS cnt
> FROM
> ((SELECT gene_occurrences.sentence_id
> FROM gene_occurrences
> WHERE gene_occurrences.gene_id=get_gene_id('csf')) AS tmp
> JOIN disease_occurrences USING (sentence_id)) as tmp2
> NATURAL JOIN disease
>GROUP BY disease.disease_name
>ORDER BY cnt DESC;
There is no DISTINCT here. This is equvalent to your first query, iff
the following unique constraints are true:
(gene_id, sentence_id) in gene_occurrences
(disease_id, sentence_id) in disease_occurrences
(disease_id) in disease
If they are, you don't need a sub-select (unless I'm missing
something, please double-check):
EXPLAIN ANALYZE
SELECT disease.disease_name, count(*) AS cnt
FROM disease, gene_occurrences, disease_occurrences
WHERE gene_occurrences.sentence_id=disease_occurrences.sentence_id
AND gene_occurrences.gene_id=get_gene_id('igm')
AND disease.disease_id=disease_occurrences.disease_id
GROUP BY tmp.disease_name
ORDER BY cnt DESC;
Anyway, your problem boils down to
EXPLAIN ANALYZE
SELECT d.disease_id, d.sentence_id
FROM gene_occurrences g, disease_occurrences d
WHERE g.sentence_id = d.sentence_id
AND g.gene_id = 'some constant value';
Play with enable_xxxx to find out which join method provides the best
performance for various gene_ids. Then we can start to fiddle with
run-time parameters to help the optimizer choose the right plan.
>Most of the runtime is used up by the index scan to join the occurrences
>tables [...]
>
>At the moment my settings concering the query planner are:
>
>effective_cache_size = 80000 # typically 8KB each, default 1000
>random_page_cost = 1.5 # units are one sequential page fetch cost
Usually you set a low random_page_cost value (the default is 4) if you
want to favour index scans where the optimizer tends to use sequential
scans. Was this your intention?
>cpu_tuple_cost = 0.01 # (same), default 0.01
>cpu_index_tuple_cost = 0.00001 # (same), default 0.001
>cpu_operator_cost = 0.005 # (same), default 0.0025
Just out of curiosity: Are these settings based on prior experience?
Servus
Manfred
From | Date | Subject | |
---|---|---|---|
Next Message | Josh Berkus | 2003-02-18 17:37:20 | Re: Questions about indexes? |
Previous Message | Chantal Ackermann | 2003-02-18 10:28:40 | Re: cost and actual time |