Re: Predicting query runtime

From: Oleg Ivanov <o(dot)ivanov(at)postgrespro(dot)ru>
To: Vinicius Segalin <vinisegalin(at)gmail(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: Predicting query runtime
Date: 2016-09-14 10:23:14
Message-ID: 57D92512.1080202@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi Vinicius,

I recommend you to read this
(http://www.doc.ic.ac.uk/~nb605/IO%20performance%20modeling%20research/Learning-based%20Query%20Performance%20-%202011.pdf)
paper. Authors make a nice classification of different query performance
prediction methods and propose their own solution for this problem.

You can also read (http://www.vldb.org/pvldb/vol9/p204-leis.pdf) to be
warned about possible pitfalls in PostgreSQL query optimizer. In my
opinion, the most unpleasant one is that you often cannot rely on
cardinality estimations made by PostgreSQL for path nodes. Typically,
the more complicated query is, the less reliable cardinality estimations
become. The good news is that cost model allows to predict query
execution time precisely enough with good cardinality estimations.

In paper (http://pages.cs.wisc.edu/~wentaowu/papers/prediction-full.pdf)
there is no machine learning. Nevertheless, you may find it interesting.
It contains good description of PostgreSQL cost model and a method for
automatic costs calibration (similar to proposed by Jeff in this thread).
The issue with the calibrating is follows: the multipliers for each
*_cost factor are not provided or even directly computed in PostgreSQL
for the majority of path nodes. The typical way of computations is not,
for example, total_cost = 10 * seq_page_cost + 25 * random_page_cost,
but total_cost = 10 * (seq_page_cost + 2 * random_page_cost) + 10 *
(random_page_cost / 2). Mathematically these formulas are equivalent,
but practically you will spend more time and write more code to extract
the multipliers in the second case.
In the above paper authors decided to calibrate costs using only those
nodes, for which the computations are not very complicated and,
therefore, the multipliers can be extracted relatively easy. Anyway,
cost models are available in src/backend/optimizer/path/costsize.c, and
you have to get inside it somehow to obtain extra information.

As for me, the paper
(http://2014.eswc-conferences.org/sites/default/files/eswc2014pd_submission_30.pdf)
is interesting mostly by their introduction of graph editing distance as
a distance on the space of paths. It is interesting because some machine
learning methods do not require feature representations of objects, but
only a valid distance function on each pair of them.

The paper (http://www.vldb.org/pvldb/vol6/p925-wu.pdf) is about
predicting query execution time for concurrent workloads and also
contains machine learning.

I hope listed papers will be useful for your master's thesis.

The post related to (https://pgconf.ru/en/2016/89977) is available here
(http://tigvarts.livejournal.com/691.html). Please note, that this post
was published in February 2016, so the information in this post is
partially outdated. Some main principles were changed during my work,
some issues for further research are closed now, while some other issues
appeared. I believe I will have a paper on my current results completed
in the early October.

------
Oleg Ivanov
Postgres Professional: http://www.postgrespro.com
<http://www.postgrespro.com/>
The Russian Postgres Company

On 09/12/2016 05:03 PM, Vinicius Segalin wrote:
> Hi everyone,
>
> I'm trying to find a way to predict query runtime (I don't need to be
> extremely precise). I've been reading some papers about it, and people
> are using machine learning to do so. For the feature vector, they use
> what the DBMS's query planner provide, such as operators and their
> cost. The thing is that I haven't found any work using PostgreSQL, so
> I'm struggling to adapt it.
> My question is if anyone is aware of a work that uses machine learning
> and PostgreSQL to predict query runtime, or maybe some other method to
> perform this.
>
> Thank you.
>
> Best regards,
>
> Vinicius Segalin

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Durumdara 2016-09-14 13:52:09 Re: Restricted access on DataBases
Previous Message Alex Sviridov 2016-09-14 10:06:02 pgAdmin3 backup over ssh tunnel