--- perform.sgml.orig Sat Feb 5 12:45:36 2005
+++ perform.sgml Tue Feb 8 17:15:48 2005
@@ -470,6 +470,286 @@
+
+
+ How the Planner Uses Statistics
+
+
+ statistics
+ of the planner
+
+
+
+ This section builds on the material covered in the previous two and
+ shows how the planner uses the system statistics to estimate the number of
+ rows each stage of a query might return. We will adopt the approach of
+ showing by example, which should provide a good feel for how this works.
+
+
+
+ Continuing with the examples drawn from the regression test
+ database (and 8.0 sources), let's start with a simple query which has
+ one restriction in its WHERE clause:
+
+
+EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 1000;
+
+ QUERY PLAN
+------------------------------------------------------------
+ Seq Scan on tenk1 (cost=0.00..470.00 rows=1031 width=244)
+ Filter: (unique1 < 1000)
+
+
+
+ The planner examines the WHERE clause condition:
+
+
+unique1 < 1000
+
+
+ and looks up the restriction function for the operator
+ < in pg_operator.
+ This is held in the column oprrest,
+ and the result in this case is scalarltsel.
+ The scalarltsel function retrieves the histogram for
+ unique1 from pg_statistics
+ - we can follow this by using the simpler pg_stats
+ view:
+
+
+SELECT histogram_bounds FROM pg_stats
+WHERE tablename='tenk1' AND attname='unique1';
+
+ histogram_bounds
+------------------------------------------------------
+ {1,970,1943,2958,3971,5069,6028,7007,7919,8982,9995}
+
+
+ Next the fraction of the histogram occupied by < 1000
+ is worked out. This is the selectivity. The histogram divides the range
+ into equal frequency buckets, so all we have to do is locate the bucket
+ that our value is in and count part of it and
+ all of the ones before. The value 1000 is clearly in
+ the second (970 - 1943) bucket, so by assuming a linear distribution of
+ values inside each bucket we can calculate the selectivity as:
+
+
+selectivity = (1 + (1000 - 970)/(1943 - 970)) / 10
+ = 0.1031
+
+
+ that is, one whole bucket plus a linear fraction of the second, divided by
+ the number of buckets. The estimated number of rows can now be calculated as
+ the product of the selectivity and the cardinality of
+ tenk1:
+
+
+rows = 10000 * 0.1031
+ = 1031
+
+
+
+
+
+ Next let's consider an example with a WHERE clause using
+ the = operator:
+
+
+EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'ATAAAA';
+
+ QUERY PLAN
+----------------------------------------------------------
+ Seq Scan on tenk1 (cost=0.00..470.00 rows=31 width=244)
+ Filter: (stringu1 = 'ATAAAA'::name)
+
+
+ Again the planner examines the WHERE clause condition:
+
+
+stringu1 = 'ATAAAA'
+
+
+ and looks up the restriction function for =, which is
+ eqsel. This case is a bit different, as the most
+ common values — MCVs, are used to determine the
+ selectivity. Let's have a look at these, with some extra columns that will
+ be useful later:
+
+
+SELECT null_frac, n_distinct, most_common_vals, most_common_freqs FROM pg_stats
+WHERE tablename='tenk1' AND attname='stringu1';
+
+null_frac | 0
+n_distinct | 672
+most_common_vals | {FDAAAA,NHAAAA,ATAAAA,BGAAAA,EBAAAA,MOAAAA,NDAAAA,OWAAAA,BHAAAA,BJAAAA}
+most_common_freqs | {0.00333333,0.00333333,0.003,0.003,0.003,0.003,0.003,0.003,0.00266667,0.00266667}
+
+
+ The selectivity is merely the frequency corresponding to 'ATAAAA':
+
+
+selectivity = 0.003
+
+
+ The estimated number of rows is just the product of this with the
+ cardinality of tenk1 as before:
+
+
+rows = 10000 * 0.003
+ = 30
+
+
+ The number displayed by EXPLAIN is one more than this,
+ due to some post estimation checks.
+
+
+
+ Now consider the same query, but with a constant that is not in the
+ MCV list:
+
+
+EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'xxx';
+
+ QUERY PLAN
+----------------------------------------------------------
+ Seq Scan on tenk1 (cost=0.00..470.00 rows=15 width=244)
+ Filter: (stringu1 = 'xxx'::name)
+
+
+ This is quite a different problem, how to estimate the selectivity when the
+ value is not in the MCV list.
+ The approach is to use the fact that the value is not in the list,
+ combined with the knowledge of the frequencies for all of the
+ MCVs:
+
+
+selectivity = (1.0 - (0.00333333 + 0.00333333 + 0.003 + 0.003 + 0.003
+ + 0.003 + 0.003 + 0.003 + 0.00266667 + 0.00266667)) / (672 - 10)
+ = 0.001465
+
+
+ That is, add up all the frequencies for the MCVs and
+ subtract them from one — because it is not one
+ of these, and divide by the remaining distinct values.
+ Notice that there are no null values so we don't have to worry about those.
+ The estimated number of rows is calculated as usual:
+
+
+rows = 10000 * 0.001465
+ = 15
+
+
+
+
+
+ In the case where there is more than one condition in the
+ WHERE clause, for example:
+
+
+EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 1000 AND stringu1 = 'xxx';
+
+ QUERY PLAN
+-----------------------------------------------------------
+ Seq Scan on tenk1 (cost=0.00..495.00 rows=2 width=244)
+ Filter: ((unique1 < 1000) AND (stringu1 = 'xxx'::name))
+
+
+ then independence is assumed and the selectivities of the individual
+ restrictions are multiplied together:
+
+
+selectivity = selectivity(unique1 < 1000) * selectivity(stringu1 = 'xxx')
+ = 0.1031 * 0.001465
+ = 0.00015104
+
+
+ The row estimates are calculated as before:
+
+
+rows = 10000 * 0.00015104
+ = 2
+
+
+
+
+ Let's examine a query that includes a JOIN :
+
+
+EXPLAIN SELECT * FROM tenk1 t1, tenk2 t2
+WHERE t1.unique1 < 50 AND t1.unique2 = t2.unique2;
+
+ QUERY PLAN
+-----------------------------------------------------------------------------------------
+ Nested Loop (cost=0.00..346.90 rows=51 width=488)
+ -> Index Scan using tenk1_unique1 on tenk1 t1 (cost=0.00..192.57 rows=51 width=244)
+ Index Cond: (unique1 < 50)
+ -> Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.00..3.01 rows=1 width=244)
+ Index Cond: ("outer".unique2 = t2.unique2)
+
+
+ The restriction on tenk1
+ unique1 < 50
is evaluated before the nested-loop join.
+ This is handled analogously to the initial example. The restriction operator
+ for < is scalarlteqsel as before,
+ but this time the value 50 is in the first bucket of the
+ unique1 histogram:
+
+
+selectivity = ((50 - 1) / (970 - 1)) / 10
+ = 0.005057
+
+rows = 10000 * 0.005057
+ = 51
+
+
+ The restriction for the join is:
+
+
+t2.unique2 = t1.unique2
+
+
+ This is due to the join method being nested-loop, with
+ tenk1 being in the outer loop. The operator is just
+ our familiar =, however the restriction function is
+ obtained from the oprjoin column of
+ pg_operator - and is eqjoinsel.
+ Additionally we use the statistical information for both
+ tenk2 and tenk1:
+
+
+SELECT tablename, null_frac,n_distinct, most_common_vals FROM pg_stats
+WHERE tablename IN ('tenk1', 'tenk2') AND attname='unique2';
+
+tablename | null_frac | n_distinct | most_common_vals
+-----------+-----------+------------+------------------
+ tenk1 | 0 | -1 |
+ tenk2 | 0 | -1 |
+
+
+ In this case there is no MCV information for
+ unique2 because all the values appear to be
+ unique, so we can use an algorithm that relies only on the number of
+ distinct values for both relations together with their null fractions:
+
+
+selectivity = (1 - 0) * (1 - 0) * min(1 / 10000, 1 / 1000)
+ = 0.0001
+
+
+ This is, subtract the null fraction from one for each of the relations,
+ and divide by the maximum of the two distinct values. The number of rows
+ that the join is likely to emit is calculated as the cardinality of
+ cartesian product of the two nodes in the nested-loop, multiplied by the
+ selectivity:
+
+
+rows = 51 * 10000 * 0.0001
+ = 51
+
+
+
+
+
Controlling the Planner with Explicit JOIN> Clauses