Re: ANALYZE sampling is too good

From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-12 18:56:42
Message-ID: 52AA06EA.1030508@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12/12/2013 10:33 AM, Claudio Freire wrote:
> Well, why not take a supersample containing all visible tuples from N
> selected blocks, and do bootstrapping over it, with subsamples of M
> independent rows each?

Well, we still need to look at each individual block to determine
grouping correlation. Let's take a worst case example: imagine a table
has *just* been created by:

CREATE TABLE newdata AS SELECT * FROM olddata ORDER BY category, item;

If "category" is fairly low cardinality, then grouping will be severe;
we can reasonably expect that if we sample 100 blocks, many of them will
have only one category value present. The answer to this is to make our
block samples fairly widely spaced and compare them.

In this simplified example, if the table had 1000 blocks, we would take
blocks 1,101,201,301,401,etc. Then we would compare the number and
content of values found on each block with the number and content found
on each other block. For example, if we see that block 101 is entirely
the category "cats", and block 701 is entirely the category "shopping"
and block 901 is split 60/40 between the categories "transportation" and
"voting", then we can assume that the level of grouping is very high,
and the number of unknown values we haven't seen is also high.

Whereas if 101 is "cats" and 201 is "cats" and 301 through 501 are
"cats" with 2% other stuff, then we assume that the level of grouping is
moderate and it's just the case that most of the dataset is "cats".
Which means that the number of unknown values we haven't seen is low.

Whereas if 101, 201, 501, and 901 have near-identical distributions of
values, we assume that the level of grouping is very low, and that there
are very few values we haven't seen.

As someone else pointed out, full-block (the proposal) vs. random-row
(our current style) doesn't have a very significant effect on estimates
of Histograms and nullfrac, as long as the sampled blocks are widely
spaced. Well, nullfrac is affected in the extreme example of a totally
ordered table where the nulls are all in one block, but I'll point out
that we can (and do) also miss that using our current algo.

Estimated grouping should, however, affect MCVs. In cases where we
estimate that grouping levels are high, the expected % of observed
values should be "discounted" somehow. That is, with total random
distribution you have a 1:1 ratio between observed frequency of a value
and assumed frequency. However, with highly grouped values, you might
have a 2:1 ratio.

Again, more math (backed by statistical analysis) is needed.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Claudio Freire 2013-12-12 19:13:23 Re: ANALYZE sampling is too good
Previous Message Andres Freund 2013-12-12 18:52:34 Re: Changeset Extraction Interfaces