From: | pgsql(at)mohawksoft(dot)com |
---|---|
To: | "Bruno Wolff III" <bruno(at)wolff(dot)to> |
Cc: | "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Ron Mayer" <rm_pg(at)cheapcomplexdevices(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Query optimizer 8.0.1 (and 8.0) |
Date: | 2005-02-07 18:28:04 |
Message-ID: | 16805.24.91.171.78.1107800884.squirrel@mail.mohawksoft.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
> On Mon, Feb 07, 2005 at 11:27:59 -0500,
> pgsql(at)mohawksoft(dot)com wrote:
>>
>> It is inarguable that increasing the sample size increases the accuracy
>> of
>> a study, especially when diversity of the subject is unknown. It is
>> known
>> that reducing a sample size increases probability of error in any poll
>> or
>> study. The required sample size depends on the variance of the whole. It
>> is mathmatically unsound to ASSUME any sample size is valid without
>> understanding the standard deviation of the set.
>
> For large populations the accuracy of estimates of statistics based on
> random
> samples from that population are not very sensitve to population size and
> depends primarily on the sample size. So that you would not expect to need
> to use larger sample sizes on larger data sets for data sets over some
> minimum size.
That assumes a fairly low standard deviation. If the standard deviation is
low, then a minimal sample size works fine. If there was zero deviation in
the data, then a sample of one works fine.
If the standard deviation is high, then you need more samples. If you have
a high standard deviation and a large data set, you need more samples than
you would need for a smaller data set.
In the current implementation of analyze.c, the default is 100 samples. On
a table of 10,000 rows, that is probably a good number characterize the
data enough for the query optimizer (1% sample). For a table with 4.6
million rows, that's less than 0.002%
Think about an iregularly occuring event, unevenly distributed throughout
the data set. A randomized sample strategy normalized across the whole
data set with too few samples will mischaracterize the event or even miss
it altogether.
From | Date | Subject | |
---|---|---|---|
Next Message | Martin Pitt | 2005-02-07 19:05:11 | Re: libpq API incompatibility between 7.4 and 8.0 |
Previous Message | Abhijit Menon-Sen | 2005-02-07 17:58:10 | Re: Patent issues and 8.1 |