Re: n_distinct off by a factor of 1000

From: Michael Lewis <mlewis(at)entrata(dot)com>
To: "Peter J(dot) Holzer" <hjp-pgsql(at)hjp(dot)at>
Cc: Klaudie Willis <Klaudie(dot)Willis(at)protonmail(dot)com>, PostgreSQL General <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: n_distinct off by a factor of 1000
Date: 2020-06-24 22:27:35
Message-ID: CAHOFxGpv06xOQqxXcfCrb-A5LgwOyBaeHuZvVP3CEvk8L=MsTQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Wed, Jun 24, 2020, 2:35 PM Peter J. Holzer <hjp-pgsql(at)hjp(dot)at> wrote:

> Yes, estimating the number of distinct values from a relatively small
> sample is hard when you don't know the underlying distribution. It might
> be possible to analyze the sample to find the distribution and get a
> better estimate. But I'm not sure how useful that would really be: If
> a few values are very common and most very rare you are probably also
> much more likely to use the common values in a query: And for those you
> you would massively underestimate their frequency if you had an accurate
> n_distinct value. That might be just as bad or even worse.

This would only be true for values that are "common" but not in the MCVs
list, right?

If we could increase the sampling ratio beyond the hard coded 300x to get a
more representative sample and use that to estimate ndistinct (and also the
frequency of the most common values) but only actually stored the 100 MCVs
(or whatever the stats target is set to for the system or column) then the
issue may be mitigated without increasing planning time because of stats
that are larger than prudent, and the "only" cost should be longer
processing time when (auto) analyzing... plus overhead for considering this
potential new setting in all analyze cases I suppose.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tim Cross 2020-06-24 23:27:49 Re: Persistent Connections
Previous Message Peter J. Holzer 2020-06-24 20:35:19 Re: n_distinct off by a factor of 1000