Re: ANALYZE sampling is too good

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-12 17:44:39
Message-ID: CAMkU=1wRH_jopyCAyUKbdQY4DWhsx1-1e2s0VVgfrryfXDe2SQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Dec 12, 2013 at 6:39 AM, Florian Pflug <fgp(at)phlo(dot)org> wrote:

> Here's an analysis of Jeff Janes' simple example of a table where our
> n_distinct estimate is way off.
>
> On Dec11, 2013, at 00:03 , Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> > create table baz as select floor(random()*10000000), md5(random()::text)
> from generate_series(1,100000000);
> > create table baz2 as select * from baz order by floor;
> > create table baz3 as select * from baz order by md5(floor::text);
> >
> > baz unclustered, baz2 is clustered with perfect correlation, baz3 is
> clustered but without correlation.
> >
> > After analyzing all of them:
> >
> > select tablename, n_distinct,correlation from pg_stats where tablename
> like 'baz%' and attname='floor' ;
> > tablename | n_distinct | correlation
> > -----------+-------------+-------------
> > baz | 8.56006e+06 | 0.00497713
> > baz2 | 333774 | 1
> > baz3 | 361048 | -0.0118147
>
> I think I understand what's going on here. I worked with a reduced test
> cases
> of 1e7 rows containing random values between 0 and 5e5 and instrumented
> analyze to print the raw ndistinct and nmultiple values of the sample
> population (i.e. the number of distinct values in the sample, and the
> number
> of distinct values which appeared more than once). I've also considered
> only
> baz and baz2, and thus removed the than unnecessary md5 column. To account
> for
> the reduced table sizes, I adjusted default_statistics_target to 10
> instead of
> 100. The resulting estimates are then
>
> tablename | n_distinct (est) | n_distinct (act)
> -----------+------------------+------------------
> baz | 391685 | 500000
> baz2 | 36001 | 500000
>
> ANALYZE assumes that both tables contain 10000048 rows and samples 3000 of
> those.
>
> The sample of baz contains 2989 distinct values, 11 of which appear more
> than
> once. The sample of baz2 contains 2878 distinct values, 117 (!) of which
> appear more than once.
>
> The very different results stem from the Duj1 estimator we use. It
> estimates
> n_distinct by computing n*d/(n - f1 + f1*n/N) where n is the number of
> samples, N the number of rows, d the number of distinct samples, and f1 the
> number of distinct samples occurring exactly once. If all samples are
> unique
> (i.e. n=d=f1) this yields N. But if f1 is less than d, the results drops
> very
> quickly - sampling baz2 produces 117 non-unique values out of 2878 -
> roughly
> 0.03% - and the estimate already less than a 1/10 of what it would be if f1
> where 0.
>
> Which leaves us with the question why sampling baz2 produces more duplicate
> values than sampling baz does. Now, if we used block sampling, that
> behaviour
> would be unsurprising - baz2 is sorted, so each block very likely contains
> each value more than since, since the row count exceeds the number of
> possible
> values by more than a magnitude. In fact, with block sampling, we'd
> probably
> see f1 values close to 0 and thus our estimate of n_distinct would be
> roughly
> equal to the number of distinct values in the *sample* population, i.e.
> about
> 300 or so.
>
> So why does vitter's algorithm fail here? Given that we see inflated
> numbers
> of duplicate values in our sample, yet still far fewer than block-based
> sampling would yield, my guess is that it's the initial reservoir filling
> that
> bites us here. After that initial filling step, the reservoir contains a
> lot of
> consecutive rows, i.e. a block-based sample taken from just a few blocks.
> If
> the replacement phase that follows somehow uses a slightly smaller
> replacement
> probability than it should, more of these rows will survive replacement,
> resulting in exactly the kind of inflated numbers of duplicate values we're
> seeing. I've yet to validate this by looking at the reservoir before and
> after
> the replacement stage, though.
>

I think the problem is more subtle than that. It is easier to visualize it
if you think of every block having the same number of rows, with that
number being fairly large. If you pick 30,000 rows at random from
1,000,000 blocks, the number of rows chosen from any given block should be
close to following a poisson distribution with average of 0.03, which means
about 29113 blocks should have exactly 1 row chosen from them while 441
would have two or more rows chosen from them.

But if you instead select 30,000 row from 30,000 blocks, which is what we
ask Vitter's algorithm to do, you get about a Poisson distribution with
average of 1.0. Then about 11036 blocks have exactly one row chosen from
them, and 7927 blocks have two or more rows sampled from it. Another
11,036 blocks get zero rows selected from them due to Vitter, in addition
to the 970,000 that didn't even get submitted to Vitter in the first place.
That is why you see too many duplicates for clustered data, as too many
blocks are sampled multiple times.

The Poisson argument doesn't apply cleanly when blocks have variable number
of rows, but the general principle still applies. Too-few blocks have
exactly one row sampled from them, and too many blocks have either zero
row, or more than one row.

It would be relatively easy to fix this if we trusted the number of visible
rows in each block to be fairly constant. But without that assumption, I
don't see a way to fix the sample selection process without reading the
entire table.

> So at least for the purpose of estimating n_distinct, our current sampling
> method seems to exhibit the worst rather than the best properties of block-
> and row- based sampling. What conclusions to draw of that, I'm not sure
> yet -
> other that if we move to block-based sampling, we'll certainly have to
> change
> the way we estimate n_distinct.
>

Perhaps we should consider changing that even if we don't change to block
based sampling--but what would the other candidates be? I guess the same
paper that presented Duj1 would be a good place to start, but are there
others? It sounds like this was discussed several years ago, but I didn't
find it in the archives with a casual search.

Cheers,

Jeff

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2013-12-12 17:47:29 Re: should we add a XLogRecPtr/LSN SQL type?
Previous Message Fabrízio de Royes Mello 2013-12-12 17:42:55 Re: Time-Delayed Standbys