From: | Florian Pflug <fgp(at)phlo(dot)org> |
---|---|
To: | tv(at)fuzzy(dot)cz |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: proposal : cross-column stats |
Date: | 2010-12-21 14:27:04 |
Message-ID: | 1ABB35BF-A361-4535-A443-8968D2D60D1F@phlo.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Dec21, 2010, at 11:37 , tv(at)fuzzy(dot)cz wrote:
> I doubt there is a way to this decision with just dist(A), dist(B) and
> dist(A,B) values. Well, we could go with a rule
>
> if [dist(A) == dist(A,B)] the [A => B]
>
> but that's very fragile. Think about estimates (we're not going to work
> with exact values of dist(?)), and then about data errors (e.g. a city
> matched to an incorrect ZIP code or something like that).
Huh? The whole point of the F(A,B)-exercise is to avoid precisely this
kind of fragility without penalizing the non-correlated case...
> This is the reason why they choose to always combine the values (with
> varying weights).
There are no varying weights involved there. What they do is to express
P(A=x,B=y) once as
P(A=x,B=y) = P(B=y|A=x)*P(A=x) and then as
P(A=x,B=y) = P(A=x|B=y)*P(B=y).
Then they assume
P(B=y|A=x) ~= dist(A)/dist(A,B) and
P(A=x|B=y) ~= dist(B)/dist(A,B),
and go on to average the two different ways of computing P(A=x,B=y), which
finally gives
P(A=x,B=y) ~= P(B=y|A=x)*P(A=x)/2 + P(A=x|B=y)*P(B=y)/2
= dist(A)*P(A=x)/(2*dist(A,B)) + dist(B)*P(B=x)/(2*dist(A,B))
= (dist(A)*P(A=x) + dist(B)*P(B=y)) / (2*dist(A,B))
That averaging steps add *no* further data-dependent weights.
>> I'd like to find a statistical explanation for that definition of
>> F(A,B), but so far I couldn't come up with any. I created a Maple 14
>> worksheet while playing around with this - if you happen to have a
>> copy of Maple available I'd be happy to send it to you..
>
> No, I don't have Maple. Have you tried Maxima
> (http://maxima.sourceforge.net) or Sage (http://www.sagemath.org/) Sage
> even has an online notebook - that seems like a very comfortable way to
> exchange this kind of data.
I haven' tried them, but I will. That java-based GUI of Maple is driving
me nuts anyway... Thanks for the pointers!
best regards,
Florian Pflug
From | Date | Subject | |
---|---|---|---|
Next Message | tv | 2010-12-21 14:51:29 | Re: proposal : cross-column stats |
Previous Message | Florian Pflug | 2010-12-21 14:04:03 | Re: [FeatureRequest] Base Convert Function |