From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | Florian Pflug <fgp(at)phlo(dot)org> |
Cc: | Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: proposal : cross-column stats |
Date: | 2010-12-21 11:28:10 |
Message-ID: | AANLkTinpjTgLDv8R-AkeQu5vhzAx4EXpgQ1H01RBYFpv@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Dec 20, 2010 at 9:29 PM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
> I tried to pick up Robert's idea of quantifying "Implicativeness" -
> i.e., finding a number between 0 and 1 that describes how close the
> (A,B) are to representing a function A -> B.
Actually Heikki's idea...
> Observe that dist(A),dist(B) <= dist(A,B) <= dist(A)*dist(B) if the
> estimates of dist(?) are consistent. From that you easily get
>
> dist(A,B)/dist(B) <= dist(A) <= dist(A,B) and
> dist(A,B)/dist(A) <= dist(B) <= dist(A,B)
>
> If dist(A) == dist(A,B), then there is a functional dependency
> A -> B, and conversely if dist(B) == dist(A,B) there is a functional
> dependency B -> A. Note that you can have both at the same time!
>
> On the other hand, if dist(B) = dist(A,B)/dist(A), then B has the
> smallest number of distinct values possible for a given combination
> of dist(A,B) and dist(A). This is the anti-function case.
>
> This motivates the definition
>
> F(A,B) = [ dist(A)*dist(B) - dist(A,B) ] / [ dist(A,B) * ( dist(B) - 1) ]
>
> (You can probably drop the "-1", it doesn't make much of a difference
> for larger values of dist(B).
>
> F(A,B) specifies where dist(A) lies relative to dist(A,B)/dist(B) and
> dist(A,B) - a value of 0 indicates dist(A) = dist(A,B)/dist(B) while
> a value of 1 indicates that dist(A) == dist(A,B).
>
> So F(A,B) is a suitable measure of "Implicativeness" - it's higher
> if the table (A,B) looks more like a function A -> B.
>
> You might use that to decide if either A->B or B->a looks function-like
> enough to use the uniform bayesian approach. Or you might even go further,
> and decide *with* bayesian formula to use - the paper you cited always
> averages
>
> P(A=x|B=y)*P(B=y) and
> P(B=y|A=x)*P(A=x)
>
> but they offer no convincing reason for that other than "We don't know
> which to pick".
Ideally you want to somehow make this a continuous transaition between
the available formulas rather than a discrete transition, I think. If
F(A,B) = 1 then the selectivity of A = x AND B = y is just P(A=x), and
if it's 0, then it's P(A=x)*P(B=y). But suppose F(A,B)=0.5. Then
what? A naive approach would be to estimate P(A=x && B=y) = P(A=x) *
(1 - (1 - F(A,B))*(1 - P(B = y))), so that if, say, P(A=x) = 0.1 and
P(B=y) = 0.1, then when F(A,B) = 0 we estimate 0.01, when F(A,B) = 1
we estimate 0.1, and when F(A,B) = 0.5 we estimate (0.1)(1 - 0.5*0.9)
= 0.055. Of course I'm just hand-waving here, and this is without any
mathematical basis, being just the simplest formula I could think of
that gets the endpoints right and plots some sort of smooth curve
between them in the middle. A similar formula with a believable
argument to back it up seems like it would be a big step forward for
this method.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2010-12-21 11:40:49 | Re: bug in SignalSomeChildren |
Previous Message | Shigeru HANADA | 2010-12-21 11:14:02 | Re: SQL/MED - file_fdw |