From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: WIP: multivariate statistics / proof of concept |
Date: | 2015-03-31 00:26:26 |
Message-ID: | 5519E9B2.5080105@2ndquadrant.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hello,
attached is a new version of the patch series. Aside from fixing various
issues (crashes, memory leaks). The patches are rebased to current
master, and I also attach a few SQL scripts I used for testing (nothing
fancy, just stress-testing all the parts the patch touches).
The main changes in the patches (requiring plenty of changes in the
other parts) are about these:
(1) combining multiple statistics on a table
--------------------------------------------
In the previous version of the patch, it was only possible to use a
single statistics on a table - when there was a statistics "covering"
all the conditions it worked fine, but that's not always the case.
The new patch is able to combine multiple statistics by decomposing the
probability (=selectivity) into conditional probabilities. Imagine
estimating selectivity of clauses
WHERE (a=1) AND (b=1) AND (c=1) AND (d=1)
with statistics on [a,b,c] and [b,c,d]. The selectivity may be split for
example like this:
P(a=1,b=1,c=1,d=1) = P(a=1,b=1,c=1) * P(d=1|a=1,b=1,c=1)
where P(a=1,b=1,c=1) may be estimated using statistics [a,b,c], and the
second may be simplified like this:
P(d=1|a=1,b=1,c=1) = P(d=1|b=1,c=1)
using the assumption "no multivariate stats => independent". Both these
probabilities match the existing statistics.
The idea is described a bit more in the part #5 of the patch.
(2) choosing the best combination of statistics
-----------------------------------------------
There may be more statistics on a table, and multiple possible ways to
use them to estimate the clauses (different ordering, overlapping
statistics, etc.).
The patch formulates this as an optimization task with two goals.
(a) cover as many clauses as possible
(b) reuse as many conditions (i.e. dependencies) as possible
and implements two algorithms to solve this: (a) exhaustive, walking
through all possible states (using dynamic programming), and (b) greedy,
choosing the best local solution in each step.
The time requirements for the exhaustive solution grows pretty quickly
with the number of clauses and statistics on a table (~ O(N!)). The
greedy is much faster, as it's ~O(N) and in fact much more time is spent
in actually processing the selected statistics (walking through the
histograms etc.).
I assume the exhaustive search may find a better solution in some cases
(that the greedy algorithm misses), but so far I've been unable to come
up with such example.
To make this easier to test, I've added GUC to switch between these
algorithms easily (set to 'greedy' by default)
mvstat_search = {'greedy', 'exhaustive'}
I assume this GUC will be removed eventually, after we figure out which
algorithm is the right one.
(3) estimation of more complex conditions (AND/OR clauses)
----------------------------------------------------------
I've added ability to estimate more complex clauses - combinations of
AND/OR clauses and such. It's somewhat incomplete at the moment, but
hopefully the ideas will be clear from the TODOs/FIXMEs along the way.
Let me know if you have any questions about this version of the patch,
or about the ideas it implements in general.
I also welcome real-world examples of poorly estimated queries, so that
I can test if these patches improve that particular case situation.
regards
--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment | Content-Type | Size |
---|---|---|
0001-shared-infrastructure-and-functional-dependencies.patch | text/x-diff | 65.9 KB |
0002-clause-reduction-using-functional-dependencies.patch | text/x-diff | 44.5 KB |
0003-multivariate-MCV-lists.patch | text/x-diff | 109.4 KB |
0004-multivariate-histograms.patch | text/x-diff | 117.4 KB |
0005-multi-statistics-estimation.patch | text/x-diff | 71.2 KB |
mcv.sql | application/sql | 25.8 KB |
histogram.sql | application/sql | 34.8 KB |
dependencies.sql | application/sql | 19.0 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Eisentraut | 2015-03-31 00:40:14 | Re: Exposing PG_VERSION_NUM in pg_config |
Previous Message | Mark Kirkwood | 2015-03-31 00:14:45 | Re: Streaming replication |