From: | Mark kirkwood <markir(at)slingshot(dot)co(dot)nz> |
---|---|
To: | pgsql-general(at)postgresql(dot)org |
Subject: | On Distributions In 7.2.1 |
Date: | 2002-05-02 04:17:47 |
Message-ID: | 1020313069.2325.19.camel@spikey.slithery.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Dear list,
I thought I would follow up my last posting on this topic by looking at 7.2.1 -
as there is a new estimator for distinct values (see src/backend/commands/analyze.c
~line 1010 onwards).
The specimen table is the same as was used in the 7.2 post...to recap briefly :
dbbench3=> \d fact0
Table "fact0"
Column | Type | Modifiers
--------+---------+-----------
d0key | integer |
d1key | integer |
d2key | integer |
val | integer |
filler | text |
Unique keys: fact0_pk
dbbench3=> select count(*) from fact0;
count
---------
3000000
(1 row)
dbbench3=> select count(distinct d0key) from fact0;
count
-------
3000
(1 row)
The data is generated with each value occurring the same number of
times - so the frequency of any particular value is :
dbbench3=> select 3000.0/3000000.0;
?column?
----------
0.001
(1 row)
The d0key values are uniformly distributed through the table, so it should be a
"good" test of the statistical sampling routines used by ANALYZE :
dbbench3=> alter table fact0 alter d0key set statistics 10;
ALTER
dbbench3=> analyze fact0;
ANALYZE
dbbench3=> select tablename,attname,n_distinct,most_common_freqs from pg_stats
where attname='d0key';
fact0 | d0key | 3128 |
{0.00266667,0.00233333,0.002,0.00166667,0.00166667,0.00166667,0.00166667,
0.00166667,0.00166667,0.00166667}
(1 row)
dbbench3=> alter table fact0 alter d0key set statistics 100;
ALTER
dbbench3=> analyze fact0;
ANALYZE
dbbench3=> select tablename,attname,n_distinct,most_common_freqs
from pg_stats where attname='d0key';
fact0 | d0key | 3000 |
{0.0007,0.0007,0.0007,0.0007,0.000666667,0.000666667,0.000666667,0.000666667,
0.000666667,0.000666667,0.000633333,0.000633333,0.000633333,0.000633333,0.000633333,
0.000633333,0.000633333,0.000633333,0.000633333,0.000633333,0.000633333,0.000633333,
0.000633333,0.000633333,0.0006,0.0006,0.0006,0.0006,0.0006,0.0006,0.0006,0.0006,0.0006,
0.0006,0.0006,0.0006,0.0006,0.0006,0.0006,0.0006,0.0006,0.000566667,0.000566667,0.000566667,
0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,
0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,
0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,
0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,
0.000566667,0.000566667,0.000533333,0.000533333,0.000533333,0.000533333,0.000533333,
0.000533333,0.000533333,0.000533333,0.000533333,0.000533333,0.000533333,0.000533333,
0.000533333,0.000533333,0.000533333,0.000533333,0.000533333,0.000533333,0.000533333,
0.000533333,0.000533333,0.000533333,0.000533333,0.000533333,0.000533333,0.000533333}
(1 row)
So the number distinct is extremely accurate, even with 10 quantiles (in 7.2 it was
overestimated by a factor of 10).
There is slightly odd behaviour with the frequencies decreasing with increasing number
of quantiles (same as 7.2 .. same code here ?).
The frequencies are estimated best for about 30 quantiles - all 30 are between 0.0009
and 0.001 in this case.
I am wondering if this is caused by my example not having any "real" most
common values (they are all as common as each other).
I am going to fiddle with my data generation script, skew the distribution and
see what effect that has.
best wishes
Mark
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2002-05-02 04:45:19 | Re: Mac OS X: system shutdown prevents checkpoint |
Previous Message | Tom Lane | 2002-05-02 03:06:13 | Re: explain and explain analyze with functions |