n_distinct way off, but following a pattern.

From: "Nick Fankhauser" <nickf(at)ontko(dot)com>
To: "Pgsql-Performance(at)Postgresql(dot) Org" <pgsql-performance(at)postgresql(dot)org>
Cc: "Ray Ontko" <rayo(at)ontko(dot)com>
Subject: n_distinct way off, but following a pattern.
Date: 2003-11-14 21:27:16
Message-ID: NEBBLAAHGLEEPCGOBHDGEEFLJHAA.nickf@ontko.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi-

I'm seeing estimates for n_distinct that are way off for a large table
(8,700,000 rows). They get better by setting the stats target higher, but
are still off by a factor of 10 with the stats set to 1000. I've noticed and
reported a similar pattern before on another table. Because this follows the
same very consistent pattern, I figured it was worth reporting again. This
looks more like a bug than randomness. If the poor result was simply due to
having a small sample to work from, the estimates should be all over the
map, but these are consistently low, and vary in almost exact inverse
proportion to the stats target:

run 1: run2: run3:
n_distinct estimate, statistics = 10: 3168 3187 3212
n_distinct estimate, statistics = 100: 23828 24059 23615
n_distinct estimate, statistics = 1000: 194690 194516 194081
Actual distinct values: 3340724

Or to put it another way, if you were to take the estimate from analyze,
divide by the stats target and multiply by 10000, the result would be pretty
close to exact. (Within a factor of 2, which ought to be plenty close for
planning purposes.)

I'm running version 7.3.2

Any thoughts from folks familiar with this part of the source code?

Regards,
-Nick

PS-
Here's a log of the session that I got this from.

alpha=# select count(distinct actor_id) from actor_case_assignment;
-[ RECORD 1 ]--
count | 3340724
alpha=# analyze;
ANALYZE
alpha=# SELECT * FROM pg_stats
alpha-# WHERE tablename='actor_case_assignment' AND attname='actor_id';
-[ RECORD
1 ]-----+-------------------------------------------------------------------
----------------------------------------------------------------------------
----------------
schemaname | public
tablename | actor_case_assignment
attname | actor_id
null_frac | 0
avg_width | 16
n_distinct | 3168
most_common_vals |
{18105XS,18115XS,18106XS,18113JD02,18115JD02,18106J27,18113XS,18113A10656,18
115LST,18108XS}
most_common_freqs |
{0.0206667,0.0206667,0.0196667,0.019,0.0176667,0.0173333,0.0163333,0.015,0.0
14,0.0136667}
histogram_bounds |
{18067A000-07P,18067PD397SC1574-1,18105LBPD,18106A2119-49,18106PD399IF845-1,
18108A03068-20,18108LECS207,18108PTW03737278-2,18111A19788-77,18115A50420,18
115XC}
correlation | 0.876795
alpha=#
alpha=# alter table actor_case_assignment alter column actor_id set
statistics 100;
ALTER TABLE
alpha=# analyze actor_case_assignment;
ANALYZE
alpha=# SELECT * FROM pg_stats
alpha-# WHERE tablename='actor_case_assignment' AND attname='actor_id';
-[ RECORD 1 ]
<Header snipped>
schemaname | public
tablename | actor_case_assignment
attname | actor_id
null_frac | 0
avg_width | 17
n_distinct | 23828
most_common_vals | {18115XS,18113JD02,18106XS,1....
<Rest of values snipped>
alpha=# alter table actor_case_assignment alter column actor_id set
statistics 1000;
ALTER TABLE
alpha=# analyze actor_case_assignment;
ANALYZE
alpha=# SELECT * FROM pg_stats
alpha-# WHERE tablename='actor_case_assignment' AND attname='actor_id';
-[ RECORD 1 ]-----
<Header snipped>
schemaname | public
tablename | actor_case_assignment
attname | actor_id
null_frac | 0
avg_width | 16
n_distinct | 194690
most_common_vals | {18106XS,18115XS,18115...
<Rest of values snipped>

alpha=# \x
Expanded display is off.
alpha=# alter table actor_case_assignment alter column actor_id set
statistics 10;
ALTER TABLE
alpha=# analyze actor_case_assignment;
ANALYZE
alpha=# select n_distinct from pg_stats where
tablename='actor_case_assignment and attname='actor_id';
alpha'# ';
ERROR: parser: parse error at or near "actor_id" at character 85
alpha=# select n_distinct from pg_stats where
tablename='actor_case_assignment' and attname='actor_id';
n_distinct
------------
3187
(1 row)

alpha=# alter table actor_case_assignment alter column actor_id set
statistics 10;
ALTER TABLE
alpha=# analyze actor_case_assignment;
ANALYZE
alpha=# select n_distinct from pg_stats where
tablename='actor_case_assignment' and attname='actor_id';
n_distinct
------------
3212
(1 row)

alpha=# analyze actor_case_assignment;
ANALYZE
alpha=# alter table actor_case_assignment alter column actor_id set
statistics 100;
ALTER TABLE
alpha=# analyze actor_case_assignment;
ANALYZE
alpha=# select n_distinct from pg_stats where
tablename='actor_case_assignment' and attname='actor_id';
n_distinct
------------
24059
(1 row)

alpha=# analyze actor_case_assignment;
ANALYZE
alpha=# alter table actor_case_assignment alter column actor_id set
statistics 100;
ALTER TABLE
alpha=# select n_distinct from pg_stats where
tablename='actor_case_assignment' and attname='actor_id';
n_distinct
------------
23615
(1 row)

alpha=# alter table actor_case_assignment alter column actor_id set
statistics 1000;
ALTER TABLE
alpha=# analyze actor_case_assignment;
ANALYZE
alpha=# select n_distinct from pg_stats where
tablename='actor_case_assignment' and attname='actor_id';
n_distinct
------------
194516
(1 row)

alpha=# analyze actor_case_assignment;
ANALYZE
alpha=# select n_distinct from pg_stats where
tablename='actor_case_assignment' and attname='actor_id';
n_distinct
------------
194081
(1 row)

alpha=#

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Andrew Sullivan 2003-11-14 21:30:16 Re: Help with count(*)
Previous Message Tom Lane 2003-11-14 20:59:14 Re: constant vs function param differs in performance