From: | Michael Dean <mdean(at)sourceview(dot)com> |
---|---|
To: | Greg Stark <gsstark(at)mit(dot)edu> |
Cc: | josh(at)agliodbs(dot)com, David Fetter <david(at)fetter(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: More thoughts about planner's cost estimates |
Date: | 2006-06-02 20:39:32 |
Message-ID: | 4480A204.3070407@sourceview.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Greg Stark wrote:
>Josh Berkus <josh(at)agliodbs(dot)com> writes:
>
>
>
>>> Using a variety of synthetic and real-world data sets, we show that
>>> distinct sampling gives estimates for distinct values queries that
>>>are within 0%-10%, whereas previous methods were typically 50%-250% off,
>>>across the spectrum of data sets and queries studied.
>>>
>>>
>>Aha. It's a question of the level of error permissable. For our
>>estimates, being 100% off is actually OK. That's why I was looking at 5%
>>block sampling; it stays within the range of +/- 50% n-distinct in 95% of
>>cases.
>>
>>
>
>Well, let's see. Say for example we're scanning 50,000 records out of 1M
>records. Using the Theorem from Charikar et al quoted as Theorem 1 in section
>2 we get that there is at least a 5% chance of getting a result with a ratio
>error at least sqrt((1M-50k)/100k ln(20)) or 5.33.
>
>So no, even assuming you have a good unbiased sample, a 5% sample is only
>going to get you to a result within 19%-533% of the real values 95% of the
>time. Not 50-250%. And that's a negative result, so it's *at best* 19-533%. On
>some distributions it may be outside that range even more than 95% of the
>time.
>
>This is entirely unlike the histograms where we have a solid foundation for a
>positive result. We can guarantee that the result will be outside +/- x% *at
>most* 5% of the time. For most distributions it'll be outside that range even
>less.
>
>And a 5% sample is a pretty big. In fact my tests earlier showed the i/o from
>5% block sampling took just as long as reading all the blocks. Even if we
>figure out what's causing that (IMHO surprising) result and improve matters I
>would only expect it to be 3-4x faster than a full scan.
>
>http://archives.postgresql.org/pgsql-hackers/2006-01/msg00285.php
>
>
>
I'm sorry to interrupt your esoteric (to me) discussion, but I have a
very simple question: would you define a "good unbiased sample"? My
statistics professor Dan Price (rest his soul) would tell me there are
only random samples of some sort, and "other", which are always biased,
and never good. Excuse my absolutes, I didn't mean Good or Evil. And
how do you calculate a level of error without randomization? And are you
talking of type I or type II error?
Michael
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2006-06-02 20:43:36 | Re: More thoughts about planner's cost estimates |
Previous Message | Tom Lane | 2006-06-02 20:38:59 | Re: COPY (query) TO file |