query optimization differs between view and explicit query

From: Reece Hart <reece(at)in-machina(dot)com>
To: pgsql-performance <pgsql-performance(at)postgresql(dot)org>
Subject: query optimization differs between view and explicit query
Date: 2004-01-30 00:31:39
Message-ID: 1075422699.4062.164.camel@tallac
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

I have a large query which I would like to place in a view. The explicit
query is sufficiently fast, but the same query as a view is much slower
and uses a different plan. I would appreciate an explanation of why this
is, and, more importantly whether/how I might coax the view to use a
different plan.

The original query:

rkh(at)csb-dev=> select distinct on (AH.p2gblataln_id) AH.p2gblataln_id,H.pseq_id,min(H.pstart) as "pstart",
max(H.pstop) as "pstop",A.ident,(A.ident/Q.len::float*100)::int as "pct_ident",
sum(H.pstop-H.pstart+1) as "aln_length",H.genasm_id,H.chr,H.plus_strand,min(H.gstart) as "gstart",
max(H.gstop) as "gstop"
from p2gblathsp H
join p2gblatalnhsp AH on H.p2gblathsp_id=AH.p2gblathsp_id
join p2gblataln A on AH.p2gblataln_id=A.p2gblataln_id
join pseq Q on H.pseq_id=Q.pseq_id
where H.pseq_id=76
group by AH.p2gblataln_id,H.pseq_id,H.genasm_id,H.chr,H.plus_strand,A.ident,Q.len
\g /dev/null
Time: 277.804 ms

Now as a view:

rkh(at)csb-dev=> create view v1 as
select distinct on (AH.p2gblataln_id) AH.p2gblataln_id,H.pseq_id,min(H.pstart) as "pstart",
max(H.pstop) as "pstop",A.ident,(A.ident/Q.len::float*100)::int as "pct_ident",
sum(H.pstop-H.pstart+1) as "aln_length",H.genasm_id,H.chr,H.plus_strand,min(H.gstart) as "gstart",
max(H.gstop) as "gstop"
from p2gblathsp H
join p2gblatalnhsp AH on H.p2gblathsp_id=AH.p2gblathsp_id
join p2gblataln A on AH.p2gblataln_id=A.p2gblataln_id
join pseq Q on H.pseq_id=Q.pseq_id
group by AH.p2gblataln_id,H.pseq_id,H.genasm_id,H.chr,H.plus_strand,A.ident,Q.len;
CREATE VIEW
Time: 103.041 ms

rkh(at)csb-dev=> select * from v1 where pseq_id=76 \g /dev/null
Time: 31973.979 ms

Okay, that's ~100x slower. The plans:

rkh(at)csb-dev=> explain select distinct on <snip... same as the first query above>
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------- Unique (cost=11157.75..11187.26 rows=454 width=40)
-> GroupAggregate (cost=11157.75..11186.13 rows=454 width=40)
-> Sort (cost=11157.75..11158.89 rows=454 width=40)
Sort Key: ah.p2gblataln_id, h.pseq_id, h.genasm_id, h.chr, h.plus_strand, a.ident, q.len
-> Nested Loop (cost=11125.62..11137.71 rows=454 width=40)
-> Index Scan using pseq_pkey on pseq q (cost=0.00..3.01 rows=2 width=6)
Index Cond: (76 = pseq_id)
-> Materialize (cost=11125.62..11127.89 rows=227 width=38)
-> Nested Loop (cost=546.15..11125.62 rows=227 width=38)
-> Hash Join (cost=546.15..10438.72 rows=227 width=34)
Hash Cond: ("outer".p2gblathsp_id = "inner".p2gblathsp_id)
-> Seq Scan on p2gblatalnhsp ah (cost=0.00..6504.03 rows=451503 width=8)
-> Hash (cost=545.58..545.58 rows=227 width=34)
-> Index Scan using p2gblathsp_p_lookup on p2gblathsp h (cost=0.00..545.58 rows=227 wid Index Cond: (pseq_id = 76)
-> Index Scan using p2gblataln_pkey on p2gblataln a (cost=0.00..3.01 rows=1 width=8)
Index Cond: ("outer".p2gblataln_id = a.p2gblataln_id)
(17 rows)

rkh(at)csb-dev=> explain select * from v1 where pseq_id=76;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------- Subquery Scan v1 (cost=246907.54..281897.70 rows=2258 width=77)
Filter: (pseq_id = 76)
-> Unique (cost=246907.54..276254.13 rows=451486 width=40)
-> GroupAggregate (cost=246907.54..275125.41 rows=451486 width=40)
-> Sort (cost=246907.54..248036.25 rows=451486 width=40)
Sort Key: ah.p2gblataln_id, h.pseq_id, h.genasm_id, h.chr, h.plus_strand, a.ident, q.len
-> Hash Join (cost=14019.29..204503.24 rows=451486 width=40)
Hash Cond: ("outer".p2gblataln_id = "inner".p2gblataln_id)
-> Hash Join (cost=7632.79..191344.45 rows=451486 width=36)
Hash Cond: ("outer".p2gblathsp_id = "inner".p2gblathsp_id)
-> Merge Join (cost=0.00..176939.38 rows=451486 width=36)
Merge Cond: ("outer".pseq_id = "inner".pseq_id)
-> Index Scan using p2gblathsp_p_lookup on p2gblathsp h (cost=0.00..16102.40 rows=451485 widt -> Index Scan using pseq_pkey on pseq q (cost=0.00..159942.83 rows=1960257 width=6)
-> Hash (cost=6504.03..6504.03 rows=451503 width=8)
-> Seq Scan on p2gblatalnhsp ah (cost=0.00..6504.03 rows=451503 width=8)
-> Hash (cost=5587.00..5587.00 rows=319800 width=8)
-> Seq Scan on p2gblataln a (cost=0.00..5587.00 rows=319800 width=8)
(18 rows)

Obviously, the problem is that the pseq_id criterion needs to be pushed
farther down into the view plan (as it was for the explicit query).
Since I've typically had comparable performance between views and
explicit queries, I'm curious to know what aspect of this query prevents
better optimization. My understanding from the 7.4.1 HISTORY is that
ANSI-style joins (with JOIN) are now optimized as efficiently as
WHERE-style joins... am I wrong?

Thanks for your time.

Reece

--
Reece Hart, http://www.in-machina.com/~reece/, GPG:0x25EC91A0

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Christopher Kings-Lynne 2004-01-30 00:52:15 Re: query optimization differs between view and explicit
Previous Message Josh Berkus 2004-01-29 23:17:19 Re: [PERFORM] Set-Returning Functions WAS: On the performance of views