Re: benchmarking the query planner

From: "Greg Stark" <stark(at)enterprisedb(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>, "jd(at)commandprompt(dot)com" <jd(at)commandprompt(dot)com>, "Josh Berkus" <josh(at)agliodbs(dot)com>, "Greg Smith" <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: benchmarking the query planner
Date: 2008-12-12 18:33:12
Message-ID: 4136ffa0812121033o105f8d6bie0f2a87082c77df7@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Dec 12, 2008 at 6:18 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I seem to recall Greg suggesting that there were ways to estimate
> ndistinct without sorting, but short of a fundamental algorithm change
> there's not going to be a win here.

Not sure if you meant me, but I can say the approach I saw documented
before involved keeping a hash of all values seen in a full table
scan. When the hash reached a maximum size you discard half the hash
key space and from then on ignore any values which hash into that
space. When you reach the end you have a hash table with a random
sample of 1/2^n distinct values (where n is the number of times the
hash table overflowed) and the counts for those values.

If you're just interested in counting distinct values in the sample I
suppose you could do it using a Bloom filter just as we've talked
about for other hash tables. You scan through the list and increment
an ndistinct counter for every value you find where the bits aren't
all already set (then set those bits). That would undercount slightly
and I'm not sure how much faster than qsort it would actually be. The
log(n) factor in qsort's average case is pretty small when we're
talking about small samples.

--
greg

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2008-12-12 18:39:20 Re: Sync Rep: First Thoughts on Code
Previous Message Simon Riggs 2008-12-12 18:31:57 Re: benchmarking the query planner