Query with "like" is really slow

From: Christian Schröder <cs(at)deriva(dot)de>
To: pgsql-general(at)postgresql(dot)org
Subject: Query with "like" is really slow
Date: 2007-09-07 11:46:23
Message-ID: 46E13A0F.9080603@deriva.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi list,
if you please have a look at the following query:

SELECT DISTINCT a FROM table1
INNER JOIN table2 USING (b)
INNER JOIN view1 USING (a)
WHERE c like '1131%'
AND d IS NOT NULL
AND e IS NOT NULL;

Unfortunately, I have not been able to construct a suitable test case,
so I had to take the query from our productive system and had to replace
all table and field names because our table layout is considered
confidential. The tables and views are explained at the end of this mail.

This query is really slow, so I had a look at the query plan:


QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=14960.17..14960.18 rows=2 width=16) (actual
time=532691.185..532692.714 rows=1625 loops=1)
-> Sort (cost=14960.17..14960.17 rows=2 width=16) (actual
time=532691.183..532691.632 rows=1625 loops=1)
Sort Key: table1.a
-> Nested Loop (cost=0.00..14960.16 rows=2 width=16) (actual
time=145.972..532673.631 rows=1625 loops=1)
Join Filter: ((table1.a)::bpchar = (table3.a)::bpchar)
-> Nested Loop (cost=0.00..1673.61 rows=2 width=16)
(actual time=0.094..31.692 rows=1627 loops=1)
-> Seq Scan on table2 (cost=0.00..186.64 rows=2
width=4) (actual time=0.052..2.259 rows=42 loops=1)
Filter: (c ~~ '1131%'::text)
-> Index Scan using "table1_b_index" on table1
(cost=0.00..743.39 rows=8 width=20) (actual time=0.047..0.625 rows=39
loops=42)
Index Cond: (table1.b = table2.b)
Filter: ((d IS NOT NULL) AND (e IS NOT NULL))
-> Index Scan using "table3_f" on table3
(cost=0.00..3737.05 rows=232498 width=16) (actual time=0.092..243.643
rows=225893 loops=1627)
Index Cond: (f = 'foo'::bpchar)
Total runtime: 532693.200 ms
(14 rows)

If I replace "where c like '1131%'" with "where substring(c from 1 for
4) = '1131'" (which to me seems to be exactly equivalent) I get a
completely different query plan which is much faster:


QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=30626.05..30626.11 rows=13 width=16) (actual
time=378.237..379.773 rows=1625 loops=1)
-> Sort (cost=30626.05..30626.08 rows=13 width=16) (actual
time=378.236..378.685 rows=1625 loops=1)
Sort Key: table1.a
-> Hash Join (cost=9507.43..30625.80 rows=13 width=16)
(actual time=22.189..368.361 rows=1625 loops=1)
Hash Cond: ((table3.a)::bpchar = (table1.a)::bpchar)
-> Seq Scan on table3 (cost=0.00..20246.38 rows=232498
width=16) (actual time=0.012..253.335 rows=225893 loops=1)
Filter: (f = 'foo'::bpchar)
-> Hash (cost=9507.22..9507.22 rows=17 width=16)
(actual time=20.921..20.921 rows=1627 loops=1)
-> Nested Loop (cost=0.00..9507.22 rows=17
width=16) (actual time=0.121..19.837 rows=1627 loops=1)
-> Seq Scan on table2 (cost=0.00..200.89
rows=14 width=4) (actual time=0.084..3.419 rows=42 loops=1)
Filter: ("substring"((c)::text, 1, 4) =
'1131'::text)
-> Index Scan using "table1_b_index" on
table1 (cost=0.00..664.64 rows=8 width=20) (actual time=0.024..0.364
rows=39 loops=42)
Index Cond: (table1.b = table2.b)
Filter: ((d IS NOT NULL) AND (e IS NOT
NULL))
Total runtime: 380.259 ms
(15 rows)

My question is: Why do I have to optimize my query (use "substring"
instead of "like") instead of having the database do this for me? Or is
there a difference between both queries which I cannot see?
What can I do to get better results when using "like"? Do I have to add
indices?
And last question: I do not really understand the first query plan. The
actual time for the outer nested loop is 532673.631 ms. As far as I have
understood the docs this includes the child nodes. But I cannot find the
time-consuming child node. I only see two child nodes: The inner nested
loop (which took 31.692 ms) and the index scan (which took 243.643 ms).
Or do I have to multiply the 243.643 ms with 1627 (number of loops)? But
even then I get 396407.161 ms, which is still far away from the
532673.631 ms in the parent node.

Thanks for any help!

Regards,
Christian

Appendix:

The tables and views look like this (I have left out the fields and
constraints that don't participate in the queries):

Table "table1"
Column | Type | Modifiers
--------+------------------+-------------------------------
a | character(12) | not null
b | integer | not null
d | double precision |
e | double precision |
... | ... | ...
Indexes:
"table1_pkey" PRIMARY KEY, btree (a)
"table1_b_index" btree (b)

Table "table2"
Column | Type | Modifiers
--------+------------------+-------------------------------
b | integer | not null
c | character(10) | not null
... | ... | ...
Indexes:
"table2_pkey" PRIMARY KEY, btree (b)

View "view1"
Column | Type | Modifiers
--------+------------------+-------------------------------
a | character(12) |
g | double precision | not null
... | ... | ...
View definition:
SELECT table3.a, table3.g
FROM ONLY table3
WHERE f = 'foo'::bpchar;

Table "table3"
Column | Type | Modifiers
--------+------------------+-------------------------------
id | integer | not null
a | character(12) | not null
f | character(20) | not null
g | double precision |
... | ... | ...
Indexes:
"table3_pkey" PRIMARY KEY, btree (id)
"table3_f" btree (f)

--
Deriva GmbH Tel.: +49 551 489500-42
Financial IT and Consulting Fax: +49 551 489500-91
Hans-Böckler-Straße 2 http://www.deriva.de
D-37079 Göttingen

Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Erwin Moller 2007-09-07 11:50:44 Re: Tutorial EXPLAIN for idiots?
Previous Message M. van Egmond 2007-09-07 11:40:57 Re: Type cast text to int4