Quick Links

Re: [HACKERS] Re: Top N queries and disbursion

From:	Bruce Momjian <maillist(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Roberto Cornacchia <rcorna(at)tin(dot)it>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] Re: Top N queries and disbursion
Date:	1999-10-07 23:53:17
Message-ID:	199910072353.TAA09358@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

> No, it's certainly not the right thing. To my understanding, disbursion
> is a measure of the frequency of the most common value of an attribute;
> but that tells you very little about how many other values there are.
> 1/disbursion is a lower bound on the number of values, but it wouldn't
> be a good estimate unless you had reason to think that the values were
> pretty evenly distributed. There could be a *lot* of very-infrequent
> values.
>
> > with 100 distinct values of an attribute uniformly distribuited in a
> > relation of 10000 tuples, disbursion was estimated as 0.002275, giving
> > us 440 distinct values.
>
> This is an illustration of the fact that Postgres' disbursion-estimator
> is pretty bad :-(. It usually underestimates the frequency of the most
> common value, unless the most common value is really frequent
> (probability > 0.2 or so). I've been trying to think of a more accurate
> way of figuring the statistic that wouldn't be unreasonably slow.
> Or, perhaps, we should forget all about disbursion and adopt some other
> statistic(s).

Yes, you have the crux of the issue. I wrote it because it was the best
thing I could think of, but it is non-optimimal. Because all the
optimal solutions seemed too slow to me, I couldn't think of a better
one.

Here is my narrative on it from vacuum.c:

---------------------------------------------------------------------------

* We compute the column min, max, null and non-null counts.
* Plus we attempt to find the count of the value that occurs most
* frequently in each column
* These figures are used to compute the selectivity of the column
*
* We use a three-bucked cache to get the most frequent item
* The 'guess' buckets count hits. A cache miss causes guess1
* to get the most hit 'guess' item in the most recent cycle, and
* the new item goes into guess2. Whenever the total count of hits
* of a 'guess' entry is larger than 'best', 'guess' becomes 'best'.
*
* This method works perfectly for columns with unique values, and columns
* with only two unique values, plus nulls.
*
* It becomes less perfect as the number of unique values increases and
* their distribution in the table becomes more random.

--
Bruce Momjian | http://www.op.net/~candle
maillist(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026

In response to

Re: Top N queries and disbursion at 1999-10-07 23:16:58 from Tom Lane

Responses

Re: [HACKERS] Re: Top N queries and disbursion at 1999-10-08 13:18:42 from Roberto Cornacchia

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Bruce Momjian	1999-10-07 23:54:29	Re: [HACKERS] Re: psql and comments
Previous Message	Tom Lane	1999-10-07 23:28:15	Re: [HACKERS] Re: psql and comments