From: | John Major <major(at)cbio(dot)mskcc(dot)org> |
---|---|
To: | |
Cc: | pgsql-performance(at)postgresql(dot)org |
Subject: | Re: How to improve speed of 3 table join &group (HUGE tables) |
Date: | 2007-10-18 19:46:19 |
Message-ID: | 4717B80B.6050706@cbio.mskcc.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
Hello Nis-
I did reset the defaults before running the explain.
Primary keys for the tables.
sequence_fragment.seq_frag_id
sequence.sequence_id
Candidate keys.
fragment_external_info.seq_frag_id (FK to sequence_fragment.seq_frag_id)
sequence_alignment.sequence_id (FK to sequence_fragment.sequence_id).
None of the fields are nullable.
sequence is the anchor table.
seq_frag_id is the primary key (and foreign key to
fragment_external_info) ~4.5 million unique entries
sequence_id is an indexed field. ~3 million distinct IDs
sequence_alignment has 500million entries, but i join on sequence_id
which has ~3million entries.
When I make the suggested changes, the new query is:
select sf.library_id, fio.clip_type , count(sf.sequence_id)
from sequence_fragment sf, fragment_external_info fio
where sf.seq_frag_id = fio.sequence_frag_id
and sf.sequence_id IN
(SELECT sequence_id from sequence_alignment)
group by sf.library_id, fio.clip_type
After making the 2 changes, the cost dropped dramatically... but is
still very high.
Original Explain cost:
cost=1308049564..1345206293 rows=54 width=16
New Explain cost:
cost=11831119..11831120 rows=54 width=16
John
Nis Jørgensen wrote:
> John Major skrev:
>
>> I am trying to join three quite large tables, and the query is
>> unbearably slow(meaning I can't get results in more than a day of
>> processing).
>> I've tried the basic optimizations I understand, and nothing has
>> improved the execute speed.... any help with this would be greatly
>> appreciated
>>
>>
>> The three tables are quite large:
>> sequence_fragment = 4.5 million rows
>> sequence_external_info = 10million rows
>> sequence_alignment = 500 million rows
>>
>>
>> The query I am attempting to run is this:
>>
>> select sf.library_id, fio.clip_type , count(distinct(sa.sequence_id))
>> from sequence_alignment sa, sequence_fragment sf,
>> fragment_external_info fio
>> where sf.seq_frag_id = fio.sequence_frag_id
>> and sf.sequence_id = sa.sequence_id
>> group by sf.library_id, fio.clip_type
>>
>>
>> NOTES:
>> ~there are indexes on all of the fields being joined (but not on
>> library_id or clip_type ). ~Everything has been re-analyzed post index
>> creation
>>
>
> What are the primary (and candidate) keys of the tables? Are any of the
> fields nullable? How many distinct values exist for
> sequence_alignment.sequence_id?
>
>
>> ~I've tried "set enable_seqscan=off" and set (join_table_order or
>> something) = 1
>>
>
>
> It would help if you turned the settings back to defaults before doing
> the ANALYZE - or provide the results of that case as well.
>
>
>> The explain plan is as follows:
>>
>
> [cut]
>
> Without trying to understand the ANALYZE output, I would suggest two
> possible optimizations:
>
> - Using count(distinct(sf.sequence_id)) instead of
> count(distinct(sa.sequence_id)).
>
> - Replacing the join to sequence_alignment with "WHERE sf.sequence_id IN
> (SELECT sequence_id from sequence_alignment)".
>
> The first one probably won't help (nor hurt), but the second one might
> be able to get rid of the table scan, or at least the need do the full
> merge join (which returns an estimated 3 billion rows).
>
> Hope this helps,
>
> Nis
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>
From | Date | Subject | |
---|---|---|---|
Next Message | Chris Kratz | 2007-10-18 20:03:50 | Re: Incorrect estimates on columns |
Previous Message | Heikki Linnakangas | 2007-10-18 19:06:11 | Re: How to improve speed of 3 table join &group (HUGE tables) |