Re: [HACKERS] Bad n_distinct estimation; hacks suggested?

From: Marko Ristola <marko(dot)ristola(at)kolumbus(dot)fi>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Dave Held <dave(dot)held(at)arraysg(dot)com>, pgsql-perform <pgsql-performance(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] Bad n_distinct estimation; hacks suggested?
Date: 2005-04-28 17:44:37
Message-ID: 42712105.6030709@kolumbus.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance


First I will comment my original idea.
Second I will give another improved suggestion (an idea).
I hope, that they will be useful for you.

(I don't know, wether the first one was useful at all because it showed,
that I and some others of us are not very good with statistics :( )

I haven't looked about the PostgreSQL code, so I don't know, that what
is possible
now, and what is not. I do know, that the full table scan and after that
incremental
statistics changes are a very big change, without looking at the code.

I meant the following idea:
- compare two equal sized samples. Then redo the same thing with double
sized samples. So do lots of unnecessary work.
Check out the correlation of the two samples to try to guess the
distribution.

So I tried to give you an idea, not to give you a full answer into the
whole problem.

I did read some parts of the attached PDFs. They did convince me,
that it seems, that the heuristics for the hard cases would actually read
almost the whole table in many cases.

I did cover the "too little sample" problem by stating that the
user should be able to give the minimum size of samples. This way you would
avoid the too small sampling problem. My purpose was not to achieve at
most 5% wrong estimates, but to decrease the 2000% wrong estimates, that
are
seen now sometimes.

Conclusions:
- No heuristics or similar thing of small samples will grant excellent
results.
- If you need excellent estimates, you need to process the whole table!
- Some special cases, like primary keys and the unique indexes and special
case column types do give easy ways to make estimates:
For example, wether a boolean column has zero, one or two distinct
values, it does not matter
so much ??? Hashing seems the right choise for all of them.

If I have understund correctly, the full table scans are out of
questions for large tables at this time.

The percentage idea of taking 10% samples seems good.

So here is another suggestion:
1. Do a full percentage scan, starting at an arbitrary position. If the
user's data is not
homogenous, this hurts it, but this way it is faster.
During that scan, try to figure out all those columns, that have at most
100 distinct values.

Of course, with it you can't go into 100% accuracy, but if the full
table scan is out of question now,
it is better, if the accuracy is for example at most ten times wrong.

You could also improve accuracy by instead of doing a 10% partial table
scan, you could
do 20 pieces of 0,5 percent partial table scans: This would improve
accuracy a bit, but keep
the speed almost the same as the partial table scan.

Here are questions for the statisticians for distinct values calculation:

If we want at most 1000% tolerance, how big percentage of table's one
column must be processed?

If we want at most 500% tolerance, how big percentage of table's one
column must be processed?

If we want at most 250% tolerance, how big percentage of table's one
column must be processed?

Better to assume, that there are at most 100 distinct values on a table,
if it helps calculations.

If we try to get as much with one discontinuous partial table scan
(0,1-10% sample), here is the information, we can gather:

1. We could gather a histogram for max(100) distinct values for each
column for every column.
2. We could measure variance and average, and the number of rows for
these 100 distinct values.
3. We could count the number of rows, that didn't match with these 100
distinct values:
they were left out from the histogram.
4. We could get a minimum and a maximum value for each column.

=> We could get exact information about the sample with one 0,1-10% pass
for many columns.

What you statisticans can gather about these values?
My idea is programmatical combined with statistics:
+ Performance: scan for example 100 blocks each of size 100Mb, because
disc I/O
is much faster this way.
+ Enables larger table percentage. I hope it helps with the statistics
formula.
Required because of more robust statistics: take those blocks at random
(not over each other) places to decrease the effect from hitting
into statistically
bad parts on the table.
+ Less table scan passes: scan all columns with limited hashing in the
first pass.
+ All easy columns are found here with one pass.
+- Harder columns need an own pass each, but we have some preliminary
knoledge of them on the given sample after all (minimum and maximum
values
and the histogram of the 100 distinct values).

Marko Ristola

Greg Stark wrote:

>"Dave Held" <dave(dot)held(at)arraysg(dot)com> writes:
>
>
>
>>>Actually, it's more to characterize how large of a sample
>>>we need. For example, if we sample 0.005 of disk pages, and
>>>get an estimate, and then sample another 0.005 of disk pages
>>>and get an estimate which is not even close to the first
>>>estimate, then we have an idea that this is a table which
>>>defies analysis based on small samples.
>>>
>>>
>>I buy that.
>>
>>
>
>Better yet is to use the entire sample you've gathered of .01 and then perform
>analysis on that sample to see what the confidence interval is. Which is
>effectively the same as what you're proposing except looking at every possible
>partition.
>
>Unfortunately the reality according to the papers that were sent earlier is
>that you will always find the results disappointing. Until your sample is
>nearly the entire table your estimates for n_distinct will be extremely
>unreliable.
>
>
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Fuhr 2005-04-28 18:48:09 Re: Returning a RECORD, not SETOF RECORD
Previous Message Markus Schaber 2005-04-28 17:35:34 Re: Statement Timeout and Locking

Browse pgsql-performance by date

  From Date Subject
Next Message Enrico Weigelt 2005-04-29 02:35:13 index on different types
Previous Message Mischa Sandberg 2005-04-28 15:21:36 Re: [HACKERS] Bad n_distinct estimation; hacks suggested?