Estimating the execution cost of a query in a partitioned schema: Weird execution plans and totally wrong execution costs (pg_class and pg_statistic)

From: Nino Arsov <nino(dot)arsov(at)gmail(dot)com>
To: pgsql-admin(at)postgresql(dot)org
Subject: Estimating the execution cost of a query in a partitioned schema: Weird execution plans and totally wrong execution costs (pg_class and pg_statistic)
Date: 2015-02-03 11:35:27
Message-ID: CAMGNKZ+5ipK+o5==gbpLrdgcTEtGVx68H1NFzJzQNVwer3XG+w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Hello,

I've taken part in some interesting research involving an estimation of the
execution cost of a query by properly updating the statistics stored
in *pg_class
*and *pg_statistic.* The idea is to perform range partitioning of tables in
a schema and then update the statistics in the two system catalogs
mentioned above without transferring any data into the new schema. This
provides fast algorithm execution times since we use some global
optimization methods.

Using EXPLAIN we have successfully managed to get very similar execution
plans and costs by the planner as if there were data in the newly created
partitions. This method works just fine as long as the queries to be
estimated consist of a join of no more than 2 tables.
When we try to estimate a query's cost that contains a join between 3 or
more tables we get huge costs and wrong plans (in the rank of millions of
cost units). When data is actually loaded into all of the partitions, the
cost does not exceed a few thousand cost units.

The strategy to determine the partitions that the query should be executed
against is using a simple interval tree to find overlapping intervals of
the range values. For each partitioned table, a new master table is created
and statistics are set to a "zero" value, meaning that the master (parent)
table contains no data, while the statistics of the child tables that
inherit the masters are updated properly.

We use the Star Schema Benchmark (a modification of TPC-H) and assume
uniformity of data.

*The question: Is the following list of updated statistics enough to fool
the planner into generating accurate execution plans as if there were data
in those partitions?*

I've been searching around for a few weeks but I'm getting nowhere. Here's
what we update:

- *pg_statistic*
- *stawidth*
- *staop*
- *stanumbersN*
- *stakindN*
- *stavaluesN*
- *pg_class*
- *relfilenode (we create an empty file and fool the file system
about its size, so the planner actually sees that there is data
physically
stored on disk, although the relation file is empty)*
- *reltuples*
- *relpages*

This probably is a both tough and demanding question, but I guess this is
the right place to ask.

Best regards,
Nino Arsov

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Tom Lane 2015-02-03 15:35:24 Re: Estimating the execution cost of a query in a partitioned schema: Weird execution plans and totally wrong execution costs (pg_class and pg_statistic)
Previous Message dezso 2015-02-03 09:36:37 Re: Missing timeline history file after execution of pg_upgrade