Re: proposal : cross-column stats

From: Florian Pflug <fgp(at)phlo(dot)org>
To: tv(at)fuzzy(dot)cz
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: proposal : cross-column stats
Date: 2010-12-18 15:59:11
Message-ID: B62CC7CD-1E8B-4A00-AC50-A56E1B214B7C@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Dec18, 2010, at 08:10 , tv(at)fuzzy(dot)cz wrote:

>> On Dec17, 2010, at 23:12 , Tomas Vondra wrote:
>>> Well, not really - I haven't done any experiments with it. For two
>>> columns selectivity equation is
>>>
>>> (dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B))
>>>
>>> where A and B are columns, dist(X) is number of distinct values in
>>> column X and sel(X) is selectivity of column X.
>>
>> Huh? This is the selectivity estimate for "A = x AND B = y"? Surely,
>> if A and B are independent, the formula must reduce to sel(A) * sel(B),
>> and I cannot see how that'd work with the formula above.
>
> Yes, it's a selectivity estimate for P(A=a and B=b). It's based on
> conditional probability, as
>
> P(A=a and B=b) = P(A=a|B=b)*P(B=b) = P(B=b|A=a)*P(A=a)
>
> and "uniform correlation" assumption so that it's possible to replace the
> conditional probabilities with constants. And those constants are then
> estimated as dist(A)/dist(A,B) or dist(B)/dist(A,B).

Ok, I think I understand this now. The selectivity equation actually
*does* reduce to sel(A) * sel(B), *if* we pick a very simple estimate
for sel(A).

Take the clause "A = x AND B = y" for example. Without knowing anything
about x and y, reasonable guesses for sel(A=x) and sel(B=y) are

sel(A=x) = 1 / dist(A)
sel(B=y) = 1 / dist(B).

This is also what we do currently, according to var_eq_non_const() in
src/backend/utils/adt/selfuncs.c, if we don't have any additional
knowledge about x (Actually, we also factor the probability of A being
NULL into this).

With these estimates, your formula becomes

sel(A=x,B=y) = 1 / dist(A,B).

and if A and B are uncorrelated, dist(A,B) ~= dist(A) * dist(B), thus

sel(A=x,B=y) = sel(A=x) * sel(B=y).

If, however, y is a constant, then we use the MKVs to estimate sel(B=y)
(var_eq_const() in src/backend/utils/adt/selfuncs.c). If

sel(B=y) ~= 0,

we'd currently also conclude that

sel(A=x,B=y) ~= 0.

With the "uniform correlation" approach, we'd instead estimate

sel(A=x,B=y) ~= sel(A=x) / dist(B)

assuming that dist(A,B) ~= dist(A)*dist(B), meaning A,B are uncorrelated.
If dist(B) is small, this estimate is much worse than what we'd currently
get, since we've effectively ignored the information that the restriction
B=y alone guarantees that only very few rows will match.

best regards,
Florian Pflug

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2010-12-18 16:26:39 Re: Crash on attempt to connect to nonstarted server
Previous Message Itagaki Takahiro 2010-12-18 07:29:10 Re: Extensions, patch v19 (encoding brainfart fix) (was: Extensions, patch v18 (merge against master, bitrot-only-fixes))