Quick Links

Workaround for cross column stats dependency

From:	"Guillaume Smet" <guillaume(dot)smet(at)gmail(dot)com>
To:	pgsql-performance <pgsql-performance(at)postgresql(dot)org>
Subject:	Workaround for cross column stats dependency
Date:	2008-01-23 00:57:07
Message-ID:	1d4e0c10801221657s6b345fcdqd1c0b3f20e519316@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

Hi -performance,

While testing 8.3, I found this query which is equally slow on 8.1 and
8.3 and seems to be really slow for a not so complex query. The stats
are as good as possible and the behaviour of PostgreSQL seems to be
logical considering the stats but I'm looking for a workaround to
speed up this query.

So here is the original query:
cityvox_prod=# EXPLAIN ANALYZE SELECT vq.codequar, vq.liblong, vq.libcourt
FROM lieu l, vilquartier vq, rubtylieu rtl, genrelieu gl, lieugelieu lgl
WHERE l.codequar = vq.codequar AND l.dfinvalidlieu is null AND
vq.codevil = 'MUC' AND lgl.numlieu = l.numlieu AND lgl.codegelieu =
gl.codegelieu
AND gl.codetylieu = rtl.codetylieu AND rtl.codeth = 'RES' -- the
interesting part is here
GROUP BY vq.codequar, vq.liblong, vq.libcourt, vq.flagintramuros
ORDER BY vq.flagintramuros, vq.liblong;

QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=2773.18..2773.25 rows=26 width=43) (actual
time=602.822..602.829 rows=13 loops=1)
Sort Key: vq.flagintramuros, vq.liblong
Sort Method: quicksort Memory: 26kB
-> HashAggregate (cost=2772.31..2772.57 rows=26 width=43) (actual
time=602.769..602.778 rows=13 loops=1)
-> Hash Join (cost=2737.48..2769.83 rows=248 width=43)
(actual time=601.999..602.580 rows=167 loops=1)
Hash Cond: ((vq.codequar)::text = (l.codequar)::text)
-> Bitmap Heap Scan on vilquartier vq
(cost=1.91..27.50 rows=47 width=43) (actual time=0.032..0.067 rows=47
loops=1)
Recheck Cond: ((codevil)::text = 'MUC'::text)
-> Bitmap Index Scan on idx_vilquartier_codevil
(cost=0.00..1.90 rows=47 width=0) (actual time=0.023..0.023 rows=47
loops=1)
Index Cond: ((codevil)::text = 'MUC'::text)
-> Hash (cost=2646.25..2646.25 rows=7145 width=5)
(actual time=601.955..601.955 rows=62526 loops=1)
-> Nested Loop (cost=0.00..2646.25 rows=*7145*
width=5) (actual time=0.058..548.618 rows=*62526* loops=1)
-> Nested Loop (cost=0.00..349.71
rows=7232 width=4) (actual time=0.049..147.221 rows=66292 loops=1)
-> Nested Loop (cost=0.00..8.59
rows=13 width=4) (actual time=0.027..0.254 rows=88 loops=1)
-> Seq Scan on rubtylieu rtl
(cost=0.00..2.74 rows=1 width=4) (actual time=0.013..0.026 rows=1
loops=1)
Filter: ((codeth)::text =
'RES'::text)
-> Index Scan using
ind_genrelieu2 on genrelieu gl (cost=0.00..5.68 rows=*14* width=8)
(actual time=0.013..0.119 rows=*88* loops=1)
Index Cond:
((gl.codetylieu)::text = (rtl.codetylieu)::text)
-> Index Scan using
idx_lieugelieu_codegelieu on lieugelieu lgl (cost=0.00..18.33
rows=633 width=8) (actual time=0.014..0.802 rows=753 loops=88)
Index Cond:
((lgl.codegelieu)::text = (gl.codegelieu)::text)
-> Index Scan using pk_lieu on lieu l
(cost=0.00..0.31 rows=1 width=9) (actual time=0.003..0.004 rows=1
loops=66292)
Index Cond: (l.numlieu = lgl.numlieu)
Filter: (l.dfinvalidlieu IS NULL)
Total runtime: 602.930 ms

The query is looking for parts of the city where we can find
restaurants. The problem of this query is that we have several
categories (here genrelieu) for a given type (here rubtylieu) and we
have several types for a given theme (here the theme is codeth =
'RES'). When the value of the theme is RES (restaurants), we have only
1 type (it's also RES). The fact is that there are a lot of rows for
the value RES in genrelieu (all the types of food and so on) compared
to other values. Considering that PostgreSQL doesn't have the value of
RES when building the stats for genrelieu, 14 rows is an expected
value considering the distribution of the values in genrelieu so
there's nothing really wrong here. But it's really really slow.

If I remove the join on rubtylieu and I inject directly the value
obtained in the query, the stats are exact and I get a far better
plan:

cityvox_prod=# EXPLAIN ANALYZE SELECT vq.codequar, vq.liblong, vq.libcourt
FROM lieu l, vilquartier vq, genrelieu gl, lieugelieu lgl
WHERE l.codequar = vq.codequar AND l.dfinvalidlieu is null AND
vq.codevil = 'MUC' AND lgl.numlieu = l.numlieu AND lgl.codegelieu =
gl.codegelieu
AND gl.codetylieu = 'RES'
GROUP BY vq.codequar, vq.liblong, vq.libcourt, vq.flagintramuros
ORDER BY vq.flagintramuros, vq.liblong;

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=7070.53..7070.59 rows=26 width=43) (actual
time=8.502..8.511 rows=13 loops=1)
Sort Key: vq.flagintramuros, vq.liblong
Sort Method: quicksort Memory: 26kB
-> HashAggregate (cost=7069.65..7069.91 rows=26 width=43) (actual
time=8.451..8.458 rows=13 loops=1)
-> Hash Join (cost=10.06..7053.22 rows=1643 width=43)
(actual time=0.318..8.267 rows=167 loops=1)
Hash Cond: ((lgl.codegelieu)::text = (gl.codegelieu)::text)
-> Nested Loop (cost=2.63..7001.77 rows=7358
width=47) (actual time=0.098..7.361 rows=973 loops=1)
-> Nested Loop (cost=2.63..5527.36 rows=4319
width=47) (actual time=0.084..2.574 rows=630 loops=1)
-> Index Scan using
idx_vilquartier_codevil on vilquartier vq (cost=0.00..34.06 rows=47
width=43) (actual time=0.023..0.062 rows=47 loops=1)
Index Cond: ((codevil)::text = 'MUC'::text)
-> Bitmap Heap Scan on lieu l
(cost=2.63..115.05 rows=146 width=9) (actual time=0.014..0.035 rows=13
loops=47)
Recheck Cond: ((l.codequar)::text =
(vq.codequar)::text)
Filter: (l.dfinvalidlieu IS NULL)
-> Bitmap Index Scan on
lieu_i_codequar (cost=0.00..2.59 rows=146 width=0) (actual
time=0.010..0.010 rows=13 loops=47)
Index Cond: ((l.codequar)::text
= (vq.codequar)::text)
-> Index Scan using
idx_lieugelieu_numlieu_principal on lieugelieu lgl (cost=0.00..0.32
rows=2 width=8) (actual time=0.003..0.005 rows=2 loops=630)
Index Cond: (lgl.numlieu = l.numlieu)
-> Hash (cost=6.33..6.33 rows=88 width=4) (actual
time=0.153..0.153 rows=88 loops=1)
-> Bitmap Heap Scan on genrelieu gl
(cost=2.23..6.33 rows=*88* width=4) (actual time=0.028..0.088
rows=*88* loops=1)
Recheck Cond: ((codetylieu)::text = 'RES'::text)
-> Bitmap Index Scan on ind_genrelieu2
(cost=0.00..2.21 rows=88 width=0) (actual time=0.021..0.021 rows=88
loops=1)
Index Cond: ((codetylieu)::text = 'RES'::text)
Total runtime: 8.589 ms

So the question is: is there any way to improve the results of the
original query, other than doing a first query in the application to
get the list of types and inject them in a second query (the one just
above)?

Thanks.

--
Guillaume

Responses

Re: Workaround for cross column stats dependency at 2008-01-23 01:43:13 from Tom Lane

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Tom Lane	2008-01-23 01:43:13	Re: Workaround for cross column stats dependency
Previous Message	Scott Marlowe	2008-01-22 20:34:15	Re: Optimizing queries that use temporary tables