Estimating number of distinct values.

From: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Estimating number of distinct values.
Date: 2018-10-24 14:06:57
Message-ID: 4338f834-dee9-2eb8-0577-10abe9d39e2d@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello hackers,

I will be pleased if somebody (first of all Robert) can comment me
"strange" results of distinct values estimation.

There is the following code in analyze.c:

            /*----------
             * Estimate the number of distinct values using the estimator
             * proposed by Haas and Stokes in IBM Research Report RJ 10025:
             *        n*d / (n - f1 + f1*n/N)
             * where f1 is the number of distinct values that occurred
             * exactly once in our sample of n rows (from a total of N),
             * and d is the total number of distinct values in the sample.
             * This is their Duj1 estimator; the other estimators they
             * recommend are considerably more complex, and are numerically
             * very unstable when n is much smaller than N.
             *
             * In this calculation, we consider only non-nulls. We used to
             * include rows with null values in the n and N counts, but
that
             * leads to inaccurate answers in columns with many nulls, and
             * it's intuitively bogus anyway considering the desired
result is
             * the number of distinct non-null values.
             *
             * Overwidth values are assumed to have been distinct.
             *----------
             */
            int            f1 = ndistinct - nmultiple + toowide_cnt;
            int            d = f1 + nmultiple;
            double        n = samplerows - null_cnt;
            double        N = totalrows * (1.0 - stats->stanullfrac);
            double        stadistinct;

            /* N == 0 shouldn't happen, but just in case ... */
            if (N > 0)
                stadistinct = (n * d) / ((n - f1) + f1 * n / N);
            else
                stadistinct = 0;

In my case there are no null values:

 totalrows = 48980014
 samplerows = 30000
 ndistinct = 29135
 nmultiple = 800
 null_cnt = 0
 stats->stanullfrac = 0
 toowide_cnt = 0

This formula gives:

 stadistinct = 503391.66267850093

"Naive" estimation of number of distinct values: d*N/n  gives 47567756
which is much closer to the real number of distinct values and is almost
100 times (!!!) larger than Duj1 estimator.

Real number of distinct value for this dataset is about 10 millions. For
some reasons, sampling using random blocks and Vitter algorithm produces
worser results than just examining first 30000 rows of the table: in
this case number of distinct values is 6835 and d/n*N is 11 millions
which is very close to real value of distinct rows. Certainly it can be
result of data distribution in the particular table.

I have read the cited article. Looks like other estimators are really
difficult to implement. But Duj1 result in this case really looks
confusing. May be instead of sampling-based estimation use streaming
based estimation, for example HyperLogLog algorithm already present in
Postgres?

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-10-24 14:17:09 Re: Log timestamps at higher resolution
Previous Message Alvaro Herrera 2018-10-24 13:58:58 Re: Log timestamps at higher resolution