| From: | Josh Berkus <josh(at)agliodbs(dot)com> | 
|---|---|
| To: | Greg Stark <gsstark(at)mit(dot)edu> | 
| Cc: | PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> | 
| Subject: | Re: default_statistics_target WAS: max_wal_senders must die | 
| Date: | 2010-10-21 01:41:36 | 
| Message-ID: | 4CBF9A50.8040604@agliodbs.com | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
> I don't see why the MCVs would need a particularly large sample size
> to calculate accurately. Have you done any tests on the accuracy of
> the MCV list?
Yes, although I don't have them at my fingertips.  In sum, though, you
can't take 10,000 samples from a 1b row table and expect to get a
remotely accurate MCV list.
A while back I did a fair bit of reading on ndistinct and large tables
from the academic literature.  The consensus of many papers was that it
took a sample of at least 3% (or 5% for block-based) of the table in
order to have 95% confidence in ndistinct of 3X.  I can't imagine that
MCV is easier than this.
> And mostly
> what it tells me is that we need a robust statistical method and the
> data structures it requires for estimating the frequency of a single
> value.
Agreed.
>  Binding the length of the MCV list to the size of the histogram is
> arbitrary but so would any other value and I haven't seen anyone
> propose any rationale for any particular value.
histogram size != sample size.  It is in our code, but that's a bug and
not a feature.
-- 
                                  -- Josh Berkus
                                     PostgreSQL Experts Inc.
                                     http://www.pgexperts.com
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Robert Haas | 2010-10-21 01:49:23 | Re: default_statistics_target WAS: max_wal_senders must die | 
| Previous Message | Robert Haas | 2010-10-21 01:34:16 | lazy snapshots? |