From: | "Greg Stark" <stark(at)enterprisedb(dot)com> |
---|---|
To: | "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>, "jd(at)commandprompt(dot)com" <jd(at)commandprompt(dot)com>, "Josh Berkus" <josh(at)agliodbs(dot)com>, "Greg Smith" <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: benchmarking the query planner |
Date: | 2008-12-12 18:33:12 |
Message-ID: | 4136ffa0812121033o105f8d6bie0f2a87082c77df7@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, Dec 12, 2008 at 6:18 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I seem to recall Greg suggesting that there were ways to estimate
> ndistinct without sorting, but short of a fundamental algorithm change
> there's not going to be a win here.
Not sure if you meant me, but I can say the approach I saw documented
before involved keeping a hash of all values seen in a full table
scan. When the hash reached a maximum size you discard half the hash
key space and from then on ignore any values which hash into that
space. When you reach the end you have a hash table with a random
sample of 1/2^n distinct values (where n is the number of times the
hash table overflowed) and the counts for those values.
If you're just interested in counting distinct values in the sample I
suppose you could do it using a Bloom filter just as we've talked
about for other hash tables. You scan through the list and increment
an ndistinct counter for every value you find where the bits aren't
all already set (then set those bits). That would undercount slightly
and I'm not sure how much faster than qsort it would actually be. The
log(n) factor in qsort's average case is pretty small when we're
talking about small samples.
--
greg
From | Date | Subject | |
---|---|---|---|
Next Message | Jeff Davis | 2008-12-12 18:39:20 | Re: Sync Rep: First Thoughts on Code |
Previous Message | Simon Riggs | 2008-12-12 18:31:57 | Re: benchmarking the query planner |