Re: Experimental evaluation of PostgreSQL's query optimizer

From: Viktor Leis <leis(at)in(dot)tum(dot)de>
To: Albe Laurenz <laurenz(dot)albe(at)wien(dot)gv(dot)at>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Experimental evaluation of PostgreSQL's query optimizer
Date: 2015-12-21 12:53:27
Message-ID: 5677F647.9030907@in.tum.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Am 21.12.2015 um 09:22 schrieb Albe Laurenz:
> Viktor Leis wrote:
>> We have recently performed an experimental evaluation of PostgreSQL's
>> query optimizer. For example, we measured the contributions of
>> cardinality estimation and the cost model on the overall query
>> performance. You can download the resulting paper here:
>> http://www.vldb.org/pvldb/vol9/p204-leis.pdf
>>
>> Some findings:
>> 1. Perhaps unsurprisingly, we found that cardinality
>> estimation is the biggest problem in query optimization.
>> 2. The quality of Postgres' cardinality estimates is not generally worse
>> than that of the major commerical systems.
>> 3. It seems to me that one obvious way to avoid many bad situations
>> would be to disable nested loop joins when the inner relation is NOT
>> an index scan.
>>
>> I hope this will be of interest to some of you.
>
> I have read the paper with great interest, and I have some comments.
>
> - The paper mentions that the "Join Order Benchmark" has high cross-table
> correlation, and this correlation is responsible for bad cardinality
> estimates that cause bad plans with all RDBMS.
> Wouldn't it be interesting to do the same experiment with a different
> real-word data sets to see if that is indeed typical and not an
> idiosyncrasy of that specific benchmark?
The IMDB data set certainly has lots of correlations in comparison
with synthetic benchmarks like TPC-H.

I do not claim that IMDB is representative (whatever that means). But
it is certainly a data set that people are putting into a database and
run join queries on it. It would indeed be very interesting to do
similar experiments with other real-world data sets. We had to come up
with our own queries because you seldom find combinations of public
relational data sets with non-trivial OLAP queries.

> - The paper suggests that sampling the base tables is preferable to
> using statistics because it gives better estimates, but I think that that
> is only a win with long running, complicated, data warehouse style queries.
> For the typical OLTP query it would incur intolerable planning times.
> Any ideas on that?
I agree that sampling is not suitable for most OLTP queries. (One hack
would be to run the optimizer without sampling and check if the
estimated cost is high and reoptimize with sampling.)

In data warehouse settings, I've seen suggestions to increase
default_statistics_target to a large value, which in my experience
results in extremely large planning times. Sampling could be a more
precise alternative.

> - From my experience in tuning SQL queries I can confirm your one finding,
> namely that bad cardinality estimates are the prime source for bad
> plan choices.
> Perhaps it would be valuable to start thinking about statistics for
> inter-table correlation. What about something as "simple" as a factor
> per (joinable) attribute pair that approximates the total row count
> of a join on these attributes, divided by the planner's estimate?
I think your suggestion amounts to caching the cardinalities of all
two-way joins. One major issue is that for a query like

select * from r1, r2 where r1.x = r2.y and r1.a = ? and r2.b;

it depends on the specific values of r1.a and r2.b whether there is
any (anti-)correlation. And obviously one cannot store correction
factors for each value of a and b.

> - I also can corroborate your finding that nested loop joins are often
> harmful, particularly when the inner loop is a sequential scan.
> One of the first things I do when investigating bad performance of a query
> whose plan has a nestend loop join is to set enable_nestloop to "off"
> and see if that makes a difference, and it often does.
> Maybe it would be a win to bias the planner against nested loop joins.
> This is dreaming, but it might be nice to have some number as to how
> reliable a certain estimate is, which is high if the estimate is, say,
> derived from a single filter on a base table and sinks as more conditions
> are involved or numbers pulled out of thin air.

I think it would be a good start to distinguish between nested loop
joins with and without a index. In my opinion, the latter should
simply NEVER be chosen.

--
Viktor Leis

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2015-12-21 12:54:43 Re: ToDo list update for BRIN indexes
Previous Message Fabien COELHO 2015-12-21 12:51:55 Re: Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?