From: | Mark kirkwood <markir(at)slingshot(dot)co(dot)nz> |
---|---|
To: | pgsql-general(at)postgresql(dot)org |
Subject: | On Distributions In 7.2 (Longish) |
Date: | 2001-10-27 03:08:38 |
Message-ID: | 01102716083800.01060@spikey.slithery.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
The current sources (7.2) have introduced distributional factors into the
system statistics.
I have been examining the behaviour of these additions, using the dataset
from my "warehouse comparison" as a test bed - as it has some large-ish
tables with controllable data distributions.
I think the results are quite interesting....
Tables
------
Table "dim0"
Column | Type | Modifiers
--------+--------------------------+-----------
d0key | integer |
f1 | timestamp with time zone |
f2 | character varying(20) |
f3 | character varying(20) |
Indexes: dim0_q1 ( on f1 - UNIQUE)
Unique keys: dim0_pk
Rows : 3000
Table "fact0"
Column | Type | Modifiers
--------+---------+-----------
d0key | integer |
d1key | integer |
d2key | integer |
val | integer |
filler | text |
Indexes: fact0_q1 (on d0key ONLY)
Rows : 3000000
Distribution : d0key uniformly distributed
1000 occurrences for each value of d0key - i.e
0.001 frequency for each value of d0key
3000 distinct values of d0key
Query
-----
The query to be examined is :
SELECT
d0.f3,
count(f.val)
FROM dim0 d0,
fact0 f
WHERE d0.d0key = f.d0key
AND d0.f1 between '1999-12-01' AND '2000-02-29'
GROUP BY d0.f3;
This will join 88 rows from the dim0 table with 88000 from the fact0 table
and group them into 3 "month" buckets.
Case 1 :
--------
Consider using the default distributional sampling settings (10 quantiles) -
--ALTER TABLE fact0 ALTER d0key SET STATISTICS 10;
ANALYZE fact0;
System Stats
------------
SELECT most_common_vals,,most_common_freqs,n_distinct FROM pg_stats WHERE
tablename = 'fact0' AND attname='d0key';
most_common_vals
{"2243","2751","105","250","525","1623","2112","2331","2983","28"}
Most_common_freqs
{"0.00233333","0.002","0.00166667","0.00166667","0.00166667","0.00166667","0.00166667","0.00166667","0.00166667","0.00133333"}
n_distinct 36511
Note we are out by an order of magnitude here for number distinct - should be
3000, and the frequencies are a little overestimated - should be 0.001
QUERY PLAN
Aggregate (cost=29712.88..29749.00 rows=722 width=18)
-> Group (cost=29712.88..29730.94 rows=7225 width=18)
-> Sort (cost=29712.88..29712.88 rows=7225 width=18)
-> Nested Loop (cost=0.00..29249.81 rows=7225 width=18)
-> Index Scan using dim0_q1 on dim0 d0 (cost=0.00..4.43
rows=89 width=10)
-> Index Scan using fact0_q1 on fact0 f
(cost=0.00..326.33 rows=81 width=8)
RESULTS
-------
f3 | count
----+-------
01 | 30000
02 | 28000
12 | 30000
(3 rows)
ELAPSED : 19s
Clearly the query statistics are underestimating the number of rows for the
nested loop join, by about a factor of 10. The estimated rows from dim0 are
ok.
Case 2 :
--------
Lets try 100 quantiles
ALTER TABLE fact0 ALTER d0key SET STATISTICS 100;
ANALYZE fact0;
System Stats
------------
most_common_vals {"328","1242","524","1515","2058","2168",( 94 more
values)...
most_common_freqs
{"0.0007","0.0007","0.000666667","0.000666667","0.000666667","0.000666667","0.000666667","0.000666667","0.000633333",....
n_distinct 3027
Now the number of distinct values is very accurate and frequencies are a
little underestimated.
QUERY PLAN
Aggregate (cost=118518.96..118958.65 rows=8794 width=18)
-> Group (cost=118518.96..118738.80 rows=87937 width=18)
-> Sort (cost=118518.96..118518.96 rows=87937 width=18)
-> Hash Join (cost=4.65..111297.50 rows=87937 width=18)
-> Seq Scan on fact0 f (cost=0.00..87693.36
rows=3000036 width=8)
-> Hash (cost=4.43..4.43 rows=89 width=10)
-> Index Scan using dim0_q1 on dim0 d0
(cost=0.00..4.43 rows=89 width=10)
ELAPSED : 60s
The query statistics are now very accurate ( e.g 89 rows from dim0 and 87937
rows joined) - note that ironically the better execution plan is chosen with
the poorer statistical data !
The conclusion here seems to be that the 10 quantiles are not quite enough
for accurate distributional data where (large) tables have a few thousand
distinct values. However 100 quantiles was sufficient to get accurate
statistics.
Further Notes
-------------
Experimentation showed that accurate estimates (+/- 10%) of number of
distinct values did not begin to appear until about 75 quantiles were used.
On the other hand reducing the number of distinct entries by a factor of 10,
while keeping the number of rows constant at 3000000 gave rise to accurate
statistics with 10 quantiles.
Given that the distribution used for fact0 is relatively benign (uniform),
the test is hopefully fair (i.e. is not constructed specially to fox the
analyzer), However the most common value for fact0 is non-unique ( since all
d0key values have the same frequency ) - I am uncertain if this is
significant...
Tests were perfomed with 7.2 snapshot 16 Oct.
regards
Mark
4AÆ
From | Date | Subject | |
---|---|---|---|
Next Message | Mark kirkwood | 2001-10-27 03:14:17 | On Distributions In 7.2 (Longish) |
Previous Message | Sheer El-Showk | 2001-10-27 02:57:27 | concurrency performance degradation |