From: | Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> |
---|---|
To: | Alexander Cheshev <alex(dot)cheshev(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Multidimensional Histograms |
Date: | 2023-12-27 21:19:57 |
Message-ID: | 6e92450c-8136-11d4-8f6e-501c693af5c8@enterprisedb.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hello Alexander,
We actually had histograms in the early patches adding multivariate
statistics [1]. We however ended up removing histograms and only kept
the simpler types, for a couple reasons.
It might be worth going through the discussion - I'm sure one of the
reasons was we needed to limit the scope of the patch, and histograms
were much more complex and possibly less useful compared to the other
statistics types.
Another reason was that the algorithm described in the two papers you
reference (1988 paper by DeWitt and the evaluation by Carlson and
Sutherland from ~2010) is simple but has pretty obvious weaknesses. It
processes the columns one by one - first build bucket on column "a",
then splits each bucket into buckets on "b". So it's not symmetrical,
and it results in lower accuracy compared to an "ideal" histogram with
the same number of buckets (especially for the dimensions split early).
This does indeed produce an equi-depth histogram, which seems nice, but
the mesh is regular in such a way that any value of the first dimension
intersects all buckets on the second dimension. So for example with a
histogram of 100x100 buckets on [a,b], any value "a=X" intersects with
100 buckets on "b", each representing 1/10000 of tuples. But we don't
know how the tuples are distributed in the tuple - so we end up using
0.5 of the bucket as unbiased estimate, but that can easily add-up in
the wrong dimension.
However, this is not the only way to build an equi-depth histogram,
there are ways to make it more symmetrical. Also, it's not clear
equi-depth histograms are ideal with multiple columns. There are papers
proposing various other types of histograms (using different criteria to
build buckets optimizing a different metric). The most interesting one
seems to be V-Optimal histograms - see for example papers [1], [2], [3],
[4], [5] and [6]. I'm sure there are more. The drawback of course is
that it's more expensive to build such histograms.
IIRC the patch tried to do something like V-optimal histograms, but
using a greedy algorithm and not the exhaustive stuff described in the
various papers.
[1] https://www.vldb.org/conf/1998/p275.pdf
[2]
https://cs-people.bu.edu/mathan/reading-groups/papers-classics/histograms.pdf
[3] https://dl.acm.org/doi/pdf/10.1145/304182.304200
[4] http://www.cs.columbia.edu/~gravano/Papers/2001/sigmod01b.pdf
[5] https://cs.gmu.edu/~carlotta/publications/vldb090.pdf
[6] https://core.ac.uk/download/pdf/82158702.pdf
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From | Date | Subject | |
---|---|---|---|
Next Message | Joe Conway | 2023-12-27 21:26:22 | Re: Password leakage avoidance |
Previous Message | Dave Cramer | 2023-12-27 21:11:17 | Re: Password leakage avoidance |