From: | Josh Berkus <josh(at)agliodbs(dot)com> |
---|---|
To: | Tomas Vondra <tv(at)fuzzy(dot)cz> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: estimating # of distinct values |
Date: | 2010-12-28 06:39:47 |
Message-ID: | 4D198633.8070406@agliodbs.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
> The simple truth is
>
> 1) sampling-based estimators are a dead-end
While I don't want to discourage you from working on steam-based
estimators ... I'd love to see you implement a proof-of-concept for
PostgreSQL, and test it ... the above is a non-argument. It requires us
to accept that sample-based estimates cannot ever be made to work,
simply because you say so.
The Charikar and Chaudhuri paper does not, in fact, say that it is
impossible to improve sampling-based estimators as you claim it does. In
fact, the authors offer several ways to improve sampling-based
estimators. Further, 2000 was hardly the end of sampling-estimation
paper publication; there are later papers with newer ideas.
For example, I still think we could tremendously improve our current
sampling-based estimator without increasing I/O by moving to block-based
estimation*. The accuracy statistics for block-based samples of 5% of
the table look quite good.
I would agree that it's impossible to get a decent estimate of
n-distinct from a 1% sample. But there's a huge difference between 5%
or 10% and "a majority of the table".
Again, don't let this discourage you from attempting to write a
steam-based estimator. But do realize that you'll need to *prove* its
superiority, head-to-head, against sampling-based estimators.
[* http://www.jstor.org/pss/1391058 (unfortunately, no longer
public-access)]
--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com
From | Date | Subject | |
---|---|---|---|
Next Message | Shigeru HANADA | 2010-12-28 06:52:38 | Re: SQL/MED - core functionality |
Previous Message | David Fetter | 2010-12-28 05:45:11 | Re: "writable CTEs" |