From: | Greg Stark <gsstark(at)mit(dot)edu> |
---|---|
To: | Josh Berkus <josh(at)agliodbs(dot)com> |
Cc: | PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: default_statistics_target WAS: max_wal_senders must die |
Date: | 2010-10-21 04:03:31 |
Message-ID: | AANLkTi=1WULMrNxYUQite54cWuDx_-SowiDUNh+593nh@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Oct 20, 2010 at 6:41 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> A while back I did a fair bit of reading on ndistinct and large tables
> from the academic literature. The consensus of many papers was that it
> took a sample of at least 3% (or 5% for block-based) of the table in
> order to have 95% confidence in ndistinct of 3X. I can't imagine that
> MCV is easier than this.
Interestingly I also read up on this but found a different and even
more pessimistic conclusions. Basically unless you're willing to read
about 50% or more of the table you can't make useful estimates at all
and even then the estimates are pretty unreliable. Which makes a lot
of sense since a handful of entries can easily completely change
ndistinct
> histogram size != sample size. It is in our code, but that's a bug and
> not a feature.
For the histogram there's a solid statistical reason why the two are related.
For ndistinct I agree you would need to sample a proportion of the
table and from what I read you really want that proportion to be 100%.
For the MCV I'm not entirely clear yet what the right answer is. It's
possible you're right but then I don't see a good algorithm for
calculating mcv accurately for large sample sizes using a reasonable
amount of resources.
--
greg
From | Date | Subject | |
---|---|---|---|
Next Message | Bruce Momjian | 2010-10-21 04:07:22 | Re: Slow count(*) again... |
Previous Message | Peter Eisentraut | 2010-10-21 03:38:55 | pg_hba.conf host name wildcard support |