From: | Florian Pflug <fgp(at)phlo(dot)org> |
---|---|
To: | tv(at)fuzzy(dot)cz |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: proposal : cross-column stats |
Date: | 2010-12-18 15:59:11 |
Message-ID: | B62CC7CD-1E8B-4A00-AC50-A56E1B214B7C@phlo.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Dec18, 2010, at 08:10 , tv(at)fuzzy(dot)cz wrote:
>> On Dec17, 2010, at 23:12 , Tomas Vondra wrote:
>>> Well, not really - I haven't done any experiments with it. For two
>>> columns selectivity equation is
>>>
>>> (dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B))
>>>
>>> where A and B are columns, dist(X) is number of distinct values in
>>> column X and sel(X) is selectivity of column X.
>>
>> Huh? This is the selectivity estimate for "A = x AND B = y"? Surely,
>> if A and B are independent, the formula must reduce to sel(A) * sel(B),
>> and I cannot see how that'd work with the formula above.
>
> Yes, it's a selectivity estimate for P(A=a and B=b). It's based on
> conditional probability, as
>
> P(A=a and B=b) = P(A=a|B=b)*P(B=b) = P(B=b|A=a)*P(A=a)
>
> and "uniform correlation" assumption so that it's possible to replace the
> conditional probabilities with constants. And those constants are then
> estimated as dist(A)/dist(A,B) or dist(B)/dist(A,B).
Ok, I think I understand this now. The selectivity equation actually
*does* reduce to sel(A) * sel(B), *if* we pick a very simple estimate
for sel(A).
Take the clause "A = x AND B = y" for example. Without knowing anything
about x and y, reasonable guesses for sel(A=x) and sel(B=y) are
sel(A=x) = 1 / dist(A)
sel(B=y) = 1 / dist(B).
This is also what we do currently, according to var_eq_non_const() in
src/backend/utils/adt/selfuncs.c, if we don't have any additional
knowledge about x (Actually, we also factor the probability of A being
NULL into this).
With these estimates, your formula becomes
sel(A=x,B=y) = 1 / dist(A,B).
and if A and B are uncorrelated, dist(A,B) ~= dist(A) * dist(B), thus
sel(A=x,B=y) = sel(A=x) * sel(B=y).
If, however, y is a constant, then we use the MKVs to estimate sel(B=y)
(var_eq_const() in src/backend/utils/adt/selfuncs.c). If
sel(B=y) ~= 0,
we'd currently also conclude that
sel(A=x,B=y) ~= 0.
With the "uniform correlation" approach, we'd instead estimate
sel(A=x,B=y) ~= sel(A=x) / dist(B)
assuming that dist(A,B) ~= dist(A)*dist(B), meaning A,B are uncorrelated.
If dist(B) is small, this estimate is much worse than what we'd currently
get, since we've effectively ignored the information that the restriction
B=y alone guarantees that only very few rows will match.
best regards,
Florian Pflug
From | Date | Subject | |
---|---|---|---|
Next Message | Bruce Momjian | 2010-12-18 16:26:39 | Re: Crash on attempt to connect to nonstarted server |
Previous Message | Itagaki Takahiro | 2010-12-18 07:29:10 | Re: Extensions, patch v19 (encoding brainfart fix) (was: Extensions, patch v18 (merge against master, bitrot-only-fixes)) |