From: | Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> |
---|---|
To: | Tomas Vondra <tv(at)fuzzy(dot)cz>, <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: WIP: multivariate statistics / proof of concept |
Date: | 2014-12-11 16:53:51 |
Message-ID: | 5489CC1F.70401@vmware.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 10/13/2014 01:00 AM, Tomas Vondra wrote:
> Hi,
>
> attached is a WIP patch implementing multivariate statistics.
Great! Really glad to see you working on this.
> + * FIXME This sample sizing is mostly OK when computing stats for
> + * individual columns, but when computing multi-variate stats
> + * for multivariate stats (histograms, mcv, ...) it's rather
> + * insufficient. For small number of dimensions it works, but
> + * for complex stats it'd be nice use sample proportional to
> + * the table (say, 0.5% - 1%) instead of a fixed size.
I don't think a fraction of the table is appropriate. As long as the
sample is random, the accuracy of a sample doesn't depend much on the
size of the population. For example, if you sample 1,000 rows from a
table with 100,000 rows, or 1000 rows from a table with 100,000,000
rows, the accuracy is pretty much the same. That doesn't change when you
go from a single variable to multiple variables.
You do need a bigger sample with multiple variables, however. My gut
feeling is that if you sample N rows for a single variable, with two
variables you need to sample N^2 rows to get the same accuracy. But it's
not proportional to the table size. (I have no proof for that, but I'm
sure there is literature on this.)
> + * Multivariate histograms
> + *
> + * Histograms are a collection of buckets, represented by n-dimensional
> + * rectangles. Each rectangle is delimited by an array of lower and
> + * upper boundaries, so that for for the i-th attribute
> + *
> + * min[i] <= value[i] <= max[i]
> + *
> + * Each bucket tracks frequency (fraction of tuples it contains),
> + * information about the inequalities, number of distinct values in
> + * each dimension (which is used when building the histogram) etc.
> + *
> + * The boundaries may be either inclusive or exclusive, or the whole
> + * dimension may be NULL.
> + *
> + * The buckets may overlap (assuming the build algorithm keeps the
> + * frequencies additive) or may not cover the whole space (i.e. allow
> + * gaps). This entirely depends on the algorithm used to build the
> + * histogram.
That sounds pretty exotic. These buckets are quite different from the
single-dimension buckets we currently have.
The paper you reference in partition_bucket() function, M.
Muralikrishna, David J. DeWitt: Equi-Depth Histograms For Estimating
Selectivity Factors For Multi-Dimensional Queries. SIGMOD Conference
1988: 28-36, actually doesn't mention overlapping buckets at all. I
haven't read the code in detail, but if it implements the algorithm from
that paper, there will be no overlap.
- Heikki
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2014-12-11 16:59:26 | Re: 9.5 release scheduling (was Re: logical column ordering) |
Previous Message | Peter Eisentraut | 2014-12-11 16:49:43 | Re: Commitfest problems |