Re: n_distinct off by a factor of 1000

From: Michael Lewis <mlewis(at)entrata(dot)com>
To: Klaudie Willis <Klaudie(dot)Willis(at)protonmail(dot)com>
Cc: Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>, "pgsql-general(at)lists(dot)postgresql(dot)org" <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: n_distinct off by a factor of 1000
Date: 2020-06-23 17:33:26
Message-ID: CAHOFxGoXQ9yCR3q=Lh2HGucJfFdgi=PUX+RAuakyiXdnjy6=XA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

>
> > > On 23/06/2020 14:42, Klaudie Willis wrote:
> > >
> > > > I got my first hint of why this problem occurs when I looked at the
> > > > statistics. For the column in question, "instrument_ref" the
> > > > statistics claimed it to be:
> > > > The default_statistics_target=500, and analyze has been run.
> > > > select * from pg_stats where attname like 'instr%_ref'; -- Result:
> > > > 40.000
> > > > select count(distinct instrumentid_ref) from bigtable -- Result: 33
> > > > 385 922 (!!)That is an astonishing difference of almost a 1000X.
>

Try something like this to check how representative those "most common
values" are. If you have n_distinct very low compared to reality and also
the fraction of the table that the "most common" values are claiming to
cover is low, then you can get very bad estimates when querying for values
that are not in the MCVs list. The planner will assume an even distribution
for other values and that may be much much higher or lower than reality.
That is, if you have statistics target of 100 like normal, and those cover
5% of the table, and you have ndistinct value of 500, then the other 400
values are assumed to evenly cover that 95% of the table so each value
would be .95/400 * reltuples as an estimate. If your real count of distinct
values is 40000 then the number of values you expect to get for each value
in your IN clause drops hugely.

Using a custom ndistinct will dramatically impact the estimates that the
planner is using to make the decision of index vs sequential scan. Also, if
the custom ndistinct and the actual distinct count vary by 2x or 10x as
your data grows, it matters very little IMO as compared to relying on the
sample taken by (auto)analyze job being off by a factor of 1000x or even
100x as you have experienced.

SELECT

( SELECT SUM (x) FROM UNNEST (most_common_freqs) x ) frac_MCV,

tablename,

attname,

inherited,

null_frac,

n_distinct,

array_length(most_common_vals,1) n_mcv,

array_length(histogram_bounds,1) n_hist,

correlation,

*

FROM pg_stats

WHERE

schemaname = 'public'

AND tablename=‘table’
AND attname=‘column’;

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Michael Lewis 2020-06-23 17:42:50 Re: UPDATE on 20 Million Records Transaction or not?
Previous Message Adrian Klaver 2020-06-23 17:25:26 Re: pg_dump empty tables