Re: proposal : cross-column stats

From: Florian Pflug <fgp(at)phlo(dot)org>
To: tv(at)fuzzy(dot)cz
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: proposal : cross-column stats
Date: 2010-12-21 14:27:04
Message-ID: 1ABB35BF-A361-4535-A443-8968D2D60D1F@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Dec21, 2010, at 11:37 , tv(at)fuzzy(dot)cz wrote:
> I doubt there is a way to this decision with just dist(A), dist(B) and
> dist(A,B) values. Well, we could go with a rule
>
> if [dist(A) == dist(A,B)] the [A => B]
>
> but that's very fragile. Think about estimates (we're not going to work
> with exact values of dist(?)), and then about data errors (e.g. a city
> matched to an incorrect ZIP code or something like that).

Huh? The whole point of the F(A,B)-exercise is to avoid precisely this
kind of fragility without penalizing the non-correlated case...

> This is the reason why they choose to always combine the values (with
> varying weights).

There are no varying weights involved there. What they do is to express
P(A=x,B=y) once as

P(A=x,B=y) = P(B=y|A=x)*P(A=x) and then as
P(A=x,B=y) = P(A=x|B=y)*P(B=y).

Then they assume

P(B=y|A=x) ~= dist(A)/dist(A,B) and
P(A=x|B=y) ~= dist(B)/dist(A,B),

and go on to average the two different ways of computing P(A=x,B=y), which
finally gives

P(A=x,B=y) ~= P(B=y|A=x)*P(A=x)/2 + P(A=x|B=y)*P(B=y)/2
= dist(A)*P(A=x)/(2*dist(A,B)) + dist(B)*P(B=x)/(2*dist(A,B))
= (dist(A)*P(A=x) + dist(B)*P(B=y)) / (2*dist(A,B))

That averaging steps add *no* further data-dependent weights.

>> I'd like to find a statistical explanation for that definition of
>> F(A,B), but so far I couldn't come up with any. I created a Maple 14
>> worksheet while playing around with this - if you happen to have a
>> copy of Maple available I'd be happy to send it to you..
>
> No, I don't have Maple. Have you tried Maxima
> (http://maxima.sourceforge.net) or Sage (http://www.sagemath.org/) Sage
> even has an online notebook - that seems like a very comfortable way to
> exchange this kind of data.

I haven' tried them, but I will. That java-based GUI of Maple is driving
me nuts anyway... Thanks for the pointers!

best regards,
Florian Pflug

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message tv 2010-12-21 14:51:29 Re: proposal : cross-column stats
Previous Message Florian Pflug 2010-12-21 14:04:03 Re: [FeatureRequest] Base Convert Function