From: | Josh Berkus <josh(at)agliodbs(dot)com> |
---|---|
To: | Andrew Dunstan <andrew(at)dunslane(dot)net> |
Cc: | gsql-perform <pgsql-performance(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: [HACKERS] Bad n_distinct estimation; hacks suggested? |
Date: | 2005-04-24 18:30:50 |
Message-ID: | 200504241130.50218.josh@agliodbs.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers pgsql-performance |
Andrew,
> The math in the paper does not seem to look at very low levels of q (=
> sample to pop ratio).
Yes, I think that's the failing. Mind you, I did more testing and found out
that for D/N ratios of 0.1 to 0.3, the formula only works within 5x accuracy
(which I would consider acceptable) with a sample size of 25% or more (which
is infeasable in any large table). The formula does work for populations
where D/N is much lower, say 0.01. So overall it seems to only work for 1/4
of cases; those where n/N is large and D/N is low. And, annoyingly, that's
probably the population where accurate estimation is least crucial, as it
consists mostly of small tables.
I've just developed (not original, probably, but original to *me*) a formula
that works on populations where n/N is very small and D/N is moderate (i.e.
0.1 to 0.4):
N * (d/n)^(sqrt(N/n))
However, I've tested it only on (n/N < 0.005 and D/N > 0.1 and D/N < 0.4)
populations, and only 3 of them to boot. I'd appreciate other people trying
it on their own data populations, particularly very different ones, like D/N
> 0.7 or D/N < 0.01.
Further, as Andrew points out we presumably do page sampling rather than
purely random sampling so I should probably read the paper he referenced.
Working on it now ....
--
Josh Berkus
Aglio Database Solutions
San Francisco
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2005-04-24 18:41:41 | Old-style OR indexscan slated for destruction |
Previous Message | Joshua D. Drake | 2005-04-24 18:01:28 | Re: Constant WAL replay |
From | Date | Subject | |
---|---|---|---|
Next Message | Josh Berkus | 2005-04-24 19:08:15 | Re: [HACKERS] Bad n_distinct estimation; hacks suggested? |
Previous Message | Andrew Dunstan | 2005-04-24 17:58:10 | Re: [HACKERS] Bad n_distinct estimation; hacks suggested? |