From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Mark kirkwood <markir(at)slingshot(dot)co(dot)nz> |
Cc: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: On Distributions In 7.2.1 |
Date: | 2002-05-02 14:11:50 |
Message-ID: | 7233.1020348710@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Mark kirkwood <markir(at)slingshot(dot)co(dot)nz> writes:
> However Tom's observation is still valid (in spite of my math) - all the
> frequencies are overestimated, rather than the expected "some bigger,
> some smaller" sort of thing.
No, that makes sense. The values that get into the most-common-values
list are only going to be ones that are significantly more common (in
the sample) than the estimated average frequency. So if the thing makes
a good estimate of the average frequency, you'll only see upside
outliers in the MCV list. The relevant logic is in analyze.c:
/*
* Decide how many values are worth storing as most-common values.
* If we are able to generate a complete MCV list (all the values
* in the sample will fit, and we think these are all the ones in
* the table), then do so. Otherwise, store only those values
* that are significantly more common than the (estimated)
* average. We set the threshold rather arbitrarily at 25% more
* than average, with at least 2 instances in the sample. Also,
* we won't suppress values that have a frequency of at least 1/K
* where K is the intended number of histogram bins; such values
* might otherwise cause us to emit duplicate histogram bin
* boundaries.
*/
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2002-05-02 14:15:29 | Re: Using views and MS access via odbc |
Previous Message | Tom Lane | 2002-05-02 13:51:37 | Re: FATAL: stuck spinlock |