From: | "Tomas Vondra" <tv(at)fuzzy(dot)cz> |
---|---|
To: | "Katharina Büchse" <katharina(dot)buechse(at)uni-jena(dot)de> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: two dimensional statistics in Postgres |
Date: | 2014-11-06 10:56:36 |
Message-ID: | 07b431d480db44208daa93cbaeb7a577.squirrel@2.emaily.eu |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
Dne 6 Listopad 2014, 11:15, Katharina Büchse napsal(a):
> Hi,
>
> I'm a phd-student at the university of Jena, Thüringen, Germany, in the
> field of data bases, more accurate query optimization.
> I want to implement a system in PostgreSQL that detects column
> correlations and creates statistical data about correlated columns for
> the optimizer. Therefore I need to store two dimensional statistics
> (especially two dimensional histograms) in PostgreSQL.
Cool!
> I had a look at the description of "WIP: multivariate statistics / proof
> of concept", which looks really promising, I guess these statistics are
> based on scans of the data? For my system I need both -- statistical
Yes, it's based on a sample of the data.
> data based on table scans (actually, samples are enough) and those based
> on query feedback. Query feedback (tuple counts and, speaking a little
> inaccurately, the where-part of the query itself) needs to be extracted
> and there needs to be a decision for the optimizer, when to take
> multivariate statistics and when to use the one dimensional ones. Oracle
> in this case just disables one dimensional histograms if there is
> already a multidimensional histogram, but this is not always useful,
> especially in the case of a feedback based histogram (which might not
> cover the whole data space). I want to use both kinds of histograms
What do you mean by not covering the whole data space? I assume that when
building feedback-based histogram, parts of the data will be filtered out
because of WHERE clauses etc. Is that what you mean? I don't see how this
could happen for regular histograms, though.
> because correlations might occur only in parts of the data. In this case
> a histogram based on a sample of the whole table might not get the point
> and wouldn't help for the part of the data the user seems to be
> interested in.
Yeah, there may be dependencies that are difficult to spot in the whole
dataset, but emerge once you filter to a specific subset.
Now, how would that work in practice? Initially the query needs to be
planned using regular stats (because there's no feedback yet), and then -
when we decide the estimates are way off - may be re-planned using the
feedback. The feedback is inherently query-specific, so I'm not sure if
it's possible to reuse it for multiple queries (might be possible for
"sufficiently similar" ones).
Would this be done automatically for all queries / all conditions, or only
when specifically enabled (on a table, columns, ...)?
> There are special data structures for storing multidimensional
> histograms based on feedback and I already tried to implement one of
> these in C. In the case of two dimensions they are of course not "for
> free" (one dimensional would be much cheaper), but based on the
> principle of maximum entropy they deliver really good results. I decided
> for only two dimensions because in this case we have the best proportion
> of cost and benefit when searching for correlation (here I'm relying on
I think hardcoding the two-dimensions limit is wrong. I understand higher
number of dimensions means more expensive operation, but if the user can
influence it, I believe it's OK.
Also, is there any particular reason why not to support other kinds of
stats (say, MCV lists)? In the end it's just a different way to
approximate the distribution, and it may be way cheaper than histograms.
> tests that were made in DB2 within a project called CORDS which detects
> correlations even between different tables).
Is this somehow related to LEO? I'm not familiar with the details, but
from the description it might be related.
regards
Tomas
From | Date | Subject | |
---|---|---|---|
Next Message | Tomas Vondra | 2014-11-06 10:57:30 | Re: two dimensional statistics in Postgres |
Previous Message | Andreas Karlsson | 2014-11-06 10:51:45 | Re: B-Tree index builds, CLUSTER, and sortsupport |