Re: Predicting query runtime

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Vinicius Segalin <vinisegalin(at)gmail(dot)com>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Predicting query runtime
Date: 2016-09-12 20:01:44
Message-ID: CAMkU=1w2pL1EtnrqWFBVQq643xnjLDbOcRtQtaarNAy_57Utpg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Mon, Sep 12, 2016 at 7:03 AM, Vinicius Segalin <vinisegalin(at)gmail(dot)com>
wrote:

> Hi everyone,
>
> I'm trying to find a way to predict query runtime (I don't need to be
> extremely precise). I've been reading some papers about it, and people are
> using machine learning to do so. For the feature vector, they use what the
> DBMS's query planner provide, such as operators and their cost. The thing
> is that I haven't found any work using PostgreSQL, so I'm struggling to
> adapt it.
> My question is if anyone is aware of a work that uses machine learning and
> PostgreSQL to predict query runtime, or maybe some other method to perform
> this.
>

I don't know about machine learning, but if there were some way to get the
planner to tell you predicted cost in terms of a breakdown of how many
multiples of each *_cost factor (rather than only a grand total which is
what it does now), then it would be fairly easy to combine that with wall
times from log_duration and do a simple linear regression.

I suspect the result would be that seq_page_cost and random_page_cost would
have huge uncertainties on them. And since pretty much every query has
non-zero predicted values for at least one of those, the huge uncertainties
would then pollute all the rest of the fitted values as well. Perhaps that
is where the machine learning would come in?

Another issue is the predicted costs are only meant to choose between
different plans, not to predict overall wall time. Some parts of the
planner only have one way to do something, and so doesn't bother to compute
a cost for that as there is no choice to be made. This would leave glaring
holes in the estimates (particularly for updates)

But to get that data out would require quite a bit of tedious altering of
the planner code, and then you would have to find people willing to run
that altered code on real world databases with a high level of logging to
gather the data. (I suspect that gathering data from only toy databases
would not be very useful).

Cheers,

Jeff

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Jeff Janes 2016-09-12 20:24:46 Re: 2.5TB Migration from SATA to SSD disks - PostgreSQL 9.2
Previous Message Lee Hachadoorian 2016-09-12 19:46:33 Replication Recommendation