Re: n_distinct off by a factor of 1000

From: Klaudie Willis <Klaudie(dot)Willis(at)protonmail(dot)com>
To: Michael Lewis <mlewis(at)entrata(dot)com>
Cc: "Peter J(dot) Holzer" <hjp-pgsql(at)hjp(dot)at>, PostgreSQL General <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: n_distinct off by a factor of 1000
Date: 2020-06-25 09:39:00
Message-ID: exYVuFsfuM3l2SdUeIoTOCojYDWE4uqL_jlZBZo0MB5ij7XZ8hlaOVzNKHJyRhb26mwAIaNxil96YgzzRxd1Tps2T4YrJUHMOtGraAi_hW8=@protonmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

> If we could increase the sampling ratio beyond the hard coded 300x to get a more representative sample and use that to estimate ndistinct (and also the frequency of the most common values) but only actually stored the 100 MCVs (or whatever the stats target is set to for the system or column) then the issue may be mitigated without increasing planning time because of stats that are larger than prudent, and the "only" cost should be longer processing time when (auto) analyzing... plus overhead for considering this potential new setting in all analyze cases I suppose.

I found another large deviation in one of my bridge tables. It is an (int,int) table of 900M rows where the B column contains 2.7M distinct values, however the pg_stats table claims it to be only 10.400. These numbers are with a statistics target of 500. I'm not sure that really matters for the planner for the queries I run, but it makes me a little nervous :)

Also, is it just my data samples, or is the n_distinct way more often underestimated by a larger ratio, than overestimated?

K

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Patrick FICHE 2020-06-25 10:03:22 RE: PostGreSQL TDE encryption patch
Previous Message Magnus Hagander 2020-06-25 09:05:00 Re: Error in Table Creation