Quick Links

Re: query plan not optimal

From:	Kevin Grittner <kgrittn(at)ymail(dot)com>
To:	Marc Cousin <cousinmarc(at)gmail(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: query plan not optimal
Date:	2013-12-19 20:36:19
Message-ID:	1387485379.74465.YahooMailNeo@web162902.mail.bf1.yahoo.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

Marc Cousin <cousinmarc(at)gmail(dot)com> wrote:

> Then we insert missing paths. This is one of the plans that fail

> insert into path (path)
>   select path from batch
>     where not exists
>           (select 1 from path where path.path=batch.path)
>     group by path;

I know you said you wanted to focus on a different query, but this
one can easily be optimized. Right now it is checking for an
existing row in path for each row in batch; and it only needs to
check once for each path. One way to write it would be:

insert into path (path)
select path from (select distinct path from batch) b
where not exists
(select 1 from path p where p.path = b.path);

> So now we insert into the file table.
>
> insert into file (pathid,filename)
> select pathid, filename from batch join path using (path);

> What I think is the cause of the problem is that the planner
> doesn't take into account that we are going to fetch the exact
> same values all the time in the path table, so we'll have a very
> good hit ratio.

It kinda takes that into account for the index part of things via
the effective_cache_size setting. That should normally be set to
50% to 75% of machine RAM.

> Maybe the n_distinct from batch.path could be used to refine the
> caching effect on the index scan ?

Interesting idea.

> For now, we work around this by using very low values for
> seq_page_cost and random_page_cost for these 2 queries.
If you are not already doing so, you might want to try setting
cpu_tuple_cost to something in the 0.03 to 0.05 range. I have
found that the default is too low compared to other cpu cost
factors, and raising it makes the exact settings for page costs
less sensitive -- that is, you get good plans over a wider range of
page cost settings. I have sometimes been unable to get a good
plan for a query without boosting this, regardless of what I do
with other settings.

Running with a development build on my 16GB development PC, I got
your fast plan with your "big data" test case by making only this
one adjustment from the postgresql.conf defaults:

set effective_cache_size = '2GB';

I also got the fast plan if I left effective_cache_size at the
default and only changed:

set cpu_tuple_cost = 0.03;

I know that there have been adjustments to cost calculations for
use of large indexes in both minor and major releases. If a little
sensible tuning of cost factors to better match reality doesn't do
it for you, you might want to consider an upgrade.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

query plan not optimal at 2013-12-19 15:33:24 from Marc Cousin

Responses

Re: query plan not optimal at 2013-12-20 06:05:35 from Marc Cousin

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Sergey Konoplev	2013-12-19 21:24:26	Re: slow query - will CLUSTER help?
Previous Message	Sergey Konoplev	2013-12-19 20:34:11	Re: slow query - will CLUSTER help?