Quick Links

Re: n_distinct off by a factor of 1000

From:	Michael Lewis <mlewis(at)entrata(dot)com>
To:	"Peter J(dot) Holzer" <hjp-pgsql(at)hjp(dot)at>
Cc:	Klaudie Willis <Klaudie(dot)Willis(at)protonmail(dot)com>, PostgreSQL General <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject:	Re: n_distinct off by a factor of 1000
Date:	2020-06-24 22:27:35
Message-ID:	CAHOFxGpv06xOQqxXcfCrb-A5LgwOyBaeHuZvVP3CEvk8L=MsTQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

On Wed, Jun 24, 2020, 2:35 PM Peter J. Holzer <hjp-pgsql(at)hjp(dot)at> wrote:

> Yes, estimating the number of distinct values from a relatively small
> sample is hard when you don't know the underlying distribution. It might
> be possible to analyze the sample to find the distribution and get a
> better estimate. But I'm not sure how useful that would really be: If
> a few values are very common and most very rare you are probably also
> much more likely to use the common values in a query: And for those you
> you would massively underestimate their frequency if you had an accurate
> n_distinct value. That might be just as bad or even worse.

This would only be true for values that are "common" but not in the MCVs
list, right?

If we could increase the sampling ratio beyond the hard coded 300x to get a
more representative sample and use that to estimate ndistinct (and also the
frequency of the most common values) but only actually stored the 100 MCVs
(or whatever the stats target is set to for the system or column) then the
issue may be mitigated without increasing planning time because of stats
that are larger than prudent, and the "only" cost should be longer
processing time when (auto) analyzing... plus overhead for considering this
potential new setting in all analyze cases I suppose.

In response to

Re: n_distinct off by a factor of 1000 at 2020-06-24 20:35:19 from Peter J. Holzer

Responses

Re: n_distinct off by a factor of 1000 at 2020-06-25 09:39:00 from Klaudie Willis
Re: n_distinct off by a factor of 1000 at 2020-06-28 21:55:50 from Peter J. Holzer

Browse pgsql-general by date

	From	Date	Subject
Next Message	Tim Cross	2020-06-24 23:27:49	Re: Persistent Connections
Previous Message	Peter J. Holzer	2020-06-24 20:35:19	Re: n_distinct off by a factor of 1000