Re: Query optimizer 8.0.1 (and 8.0)

From: Bruno Wolff III <bruno(at)wolff(dot)to>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Query optimizer 8.0.1 (and 8.0)
Date: 2005-02-08 16:50:08
Message-ID: 20050208165008.GA1768@wolff.to
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Feb 07, 2005 at 17:45:23 -0500,
Greg Stark <gsstark(at)mit(dot)edu> wrote:
>
> However for discrete values like the "top ten most common values" and the
> "total number of distinct values" it's not so clear at all that you can
> extrapolate from a sample at all. And it's certainly not clear that a fixed
> size sample gives you at all the same confidence for a large population as it
> does for a small population.

If you were to keep a complete histogram for the sample points (which may
or may not be practical) you should be able to estimate the number of
distinct values under some assumptions. Such as, that all values outside
of the top N values have the same likelihood. I don't think this is
unreasonable.

>
> If you sampled across the country and found your sample of 600 people had 4
> different top choices for president, how do you extrapolate that to calculate
> the total number of top choices for president the 300 million+ people will
> have across the country? You could multiply by 500k but that's not going to be
> right. Hell you're probably better off just going with "4" but that's clearly
> not right either.

Well you can put some bounds on the number. Since no 5th candidate was
picked by any of the 600 people, you would expect that the number of
people prefering other candidates is on the order of 500000 or less, so
that the number of distinct values is going to be 500000 or less.

I think the histogram idea will work well for estimating the number of
rows with particular values, but if you are interested in the number
of unique values, you are going to have problems with some data sets.
(Ones with a few very common values and lots of extremely rare items.)
In this case there may be some way to use information from indexes on
the data to get better results.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Ron Mayer 2005-02-08 17:44:07 correlation in pg_stats
Previous Message Jim Buttafuoco 2005-02-08 16:47:42 Re: float4 regression test failed on linux parisc