From: | Mischa Sandberg <mischa(dot)sandberg(at)telus(dot)net> |
---|---|
To: | Josh Berkus <josh(at)agliodbs(dot)com> |
Cc: | Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-perform <pgsql-performance(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: [HACKERS] Bad n_distinct estimation; hacks suggested? |
Date: | 2005-04-28 15:21:36 |
Message-ID: | 1114701696.4270ff80d577c@webmail.telus.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers pgsql-performance |
Quoting Josh Berkus <josh(at)agliodbs(dot)com>:
> > >Perhaps I can save you some time (yes, I have a degree in Math). If I
> > >understand correctly, you're trying extrapolate from the correlation
> > >between a tiny sample and a larger sample. Introducing the tiny sample
> > >into any decision can only produce a less accurate result than just
> > >taking the larger sample on its own; GIGO. Whether they are consistent
> > >with one another has no relationship to whether the larger sample
> > >correlates with the whole population. You can think of the tiny sample
> > >like "anecdotal" evidence for wonderdrugs.
>
> Actually, it's more to characterize how large of a sample we need. For
> example, if we sample 0.005 of disk pages, and get an estimate, and then
> sample another 0.005 of disk pages and get an estimate which is not even
> close to the first estimate, then we have an idea that this is a table
which
> defies analysis based on small samples. Wheras if the two estimates
are <
> 1.0 stdev apart, we can have good confidence that the table is easily
> estimated. Note that this doesn't require progressively larger
samples; any
> two samples would work.
We're sort of wandering away from the area where words are a good way
to describe the problem. Lacking a common scratchpad to work with,
could I suggest you talk to someone you consider has a background in
stats, and have them draw for you why this doesn't work?
About all you can get out of it is, if the two samples are
disjunct by a stddev, yes, you've demonstrated that the union
of the two populations has a larger stddev than either of them;
but your two stddevs are less info than the stddev of the whole.
Breaking your sample into two (or three, or four, ...) arbitrary pieces
and looking at their stddevs just doesn't tell you any more than what
you start with.
--
"Dreams come true, not free." -- S.Sondheim, ITW
From | Date | Subject | |
---|---|---|---|
Next Message | Kris Jurka | 2005-04-28 15:22:01 | Re: Statement Timeout and Locking |
Previous Message | Robert Treat | 2005-04-28 15:12:20 | Re: [HACKERS] Increased company involvement |
From | Date | Subject | |
---|---|---|---|
Next Message | Marko Ristola | 2005-04-28 17:44:37 | Re: [HACKERS] Bad n_distinct estimation; hacks suggested? |
Previous Message | Mischa Sandberg | 2005-04-28 15:00:53 | Re: Suggestions for a data-warehouse migration routine |