query plan not optimal

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Marc Cousin <cousinmarc(at)gmail(dot)com>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: query plan not optimal
Date: 2013-12-29 18:51:46
Message-ID: CAMkU=1zyT5HM6npOR0jF5LyAGYtgh9KCfLWyfYLLG8fFxqK=zA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Thursday, December 19, 2013, Marc Cousin wrote:

>
>
> Yeah, I had forgotten to set it up correctly on this test environment
> (its value is correctly set in production environments). Putting it to a
> few gigabytes here gives me this cost:
>
> bacula=# explain select pathid, filename from batch join path using (path);
> QUERY PLAN
>
> ----------------------------------------------------------------------------
> Nested Loop (cost=0.56..2083904.10 rows=479020 width=26)
> -> Seq Scan on batch (cost=0.00..11727.20 rows=479020 width=85)
> -> Index Scan using idx_path on path (cost=0.56..4.32 rows=1 width=16)
> Index Cond: (path = batch.path)
> (4 lignes)
>
> It still chooses the hash join though, but by a smaller margin.
>

This is still a tangent from your original point, but if I create index on
path (path, pathid), then I can get an index only scan. This actually is
not much faster when everything is already cached, but the planner thinks
it will be about 2x faster because it assumes the vm block accesses are
free. So this might be enough to tip it over for you.

> And it still only will access a very small part of path (always the same
> 5000 records) during the query, which isn't accounted for in the cost if
> I understand correctly ?
>

I think you are correct, that it doesn't take account of ndistinct being 10
to 100 fold less than ntuples on the outer loop, which theoretically could
propagate down to the table size used in connection with
effecitve_cache_size.

It seems to me the cost of the hash join is being greatly underestimated,
which I think is more important than the nested loop being overestimated.
(And in my hands, the merge join is actually the winner both in the
planner and in reality, but I suspect this is because all of your fake
paths are lexically greater than all of the real paths)

Also, you talked earlier about cheating the planner by lowering
random_page_cost. But why is that cheating? If caching means the cost is
truly lower...

Cheers,

Jeff

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Devrim GÜNDÜZ 2013-12-29 20:08:41 Re: Error install -pgmemcache
Previous Message Michael Kolomeitsev 2013-12-29 00:52:07 Re: Pg makes nonoptimal choice between two multicolumn indexes with the same columns but in different order.