From: | tv(at)fuzzy(dot)cz |
---|---|
To: | "Florian Pflug" <fgp(at)phlo(dot)org> |
Cc: | tv(at)fuzzy(dot)cz, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: proposal : cross-column stats |
Date: | 2010-12-21 14:51:29 |
Message-ID: | 7482a3473efb8744e7e10b4dc9c47499.squirrel@sq.gransy.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
> On Dec21, 2010, at 11:37 , tv(at)fuzzy(dot)cz wrote:
>> I doubt there is a way to this decision with just dist(A), dist(B) and
>> dist(A,B) values. Well, we could go with a rule
>>
>> if [dist(A) == dist(A,B)] the [A => B]
>>
>> but that's very fragile. Think about estimates (we're not going to work
>> with exact values of dist(?)), and then about data errors (e.g. a city
>> matched to an incorrect ZIP code or something like that).
>
> Huh? The whole point of the F(A,B)-exercise is to avoid precisely this
> kind of fragility without penalizing the non-correlated case...
Yes, I understand the intention, but I'm not sure how exactly do you want
to use the F(?,?) function to compute the P(A,B) - which is the value
we're looking for.
If I understand it correctly, you proposed something like this
IF (F(A,B) > F(B,A)) THEN
P(A,B) := c*P(A);
ELSE
P(A,B) := d*P(B);
END IF;
or something like that (I guess c=dist(A)/dist(A,B) and
d=dist(B)/dist(A,B)). But what if F(A,B)=0.6 and F(B,A)=0.59? This may
easily happen due to data errors / imprecise estimate.
And this actually matters, because P(A) and P(B) may be actually
significantly different. So this would be really vulnerable to slight
changes in the estimates etc.
>> This is the reason why they choose to always combine the values (with
>> varying weights).
>
> There are no varying weights involved there. What they do is to express
> P(A=x,B=y) once as
>
> ...
>
> P(A=x,B=y) ~= P(B=y|A=x)*P(A=x)/2 + P(A=x|B=y)*P(B=y)/2
> = dist(A)*P(A=x)/(2*dist(A,B)) +
> dist(B)*P(B=x)/(2*dist(A,B))
> = (dist(A)*P(A=x) + dist(B)*P(B=y)) / (2*dist(A,B))
>
> That averaging steps add *no* further data-dependent weights.
Sorry, by 'varying weights' I didn't mean that the weights are different
for each value of A or B. What I meant is that they combine the values
with different weights (just as you explained).
regards
Tomas
From | Date | Subject | |
---|---|---|---|
Next Message | Florian Pflug | 2010-12-21 15:24:39 | Re: proposal : cross-column stats |
Previous Message | Florian Pflug | 2010-12-21 14:27:04 | Re: proposal : cross-column stats |