From: | Florian Pflug <fgp(at)phlo(dot)org> |
---|---|
To: | tv(at)fuzzy(dot)cz |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: proposal : cross-column stats |
Date: | 2010-12-21 15:24:39 |
Message-ID: | 7E7AEE68-CF2C-40FE-B4C5-BB8374624F2E@phlo.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Dec21, 2010, at 15:51 , tv(at)fuzzy(dot)cz wrote:
>>> This is the reason why they choose to always combine the values (with
>>> varying weights).
>>
>> There are no varying weights involved there. What they do is to express
>> P(A=x,B=y) once as
>>
>> ...
>>
>> P(A=x,B=y) ~= P(B=y|A=x)*P(A=x)/2 + P(A=x|B=y)*P(B=y)/2
>> = dist(A)*P(A=x)/(2*dist(A,B)) +
>> dist(B)*P(B=x)/(2*dist(A,B))
>> = (dist(A)*P(A=x) + dist(B)*P(B=y)) / (2*dist(A,B))
>>
>> That averaging steps add *no* further data-dependent weights.
>
> Sorry, by 'varying weights' I didn't mean that the weights are different
> for each value of A or B. What I meant is that they combine the values
> with different weights (just as you explained).
I'm still not sure we're on the same page here. The resulting formula
is *not* a weighted average of P(A=x) and P(B=y), since in general
dist(A) + dist(B) = 2*dist(A,B) does *not* hold. It may look like one
syntactically, but that's about it.
The resulting formula instead is an *unweighted* (weights 1) average of
the two estimates P(B=y|A=x)*P(A=x) and P(A=x|B=y)*P(B=y). You might just
as well estimate P(A=x,B=y) with
P(B=y|A=x)*P(A=x) = dist(A)*P(A=x)/dist(A,B)
and it's *still* be the very same uniform bayesian approach, just no
longer symmetric in A and B. Which may easily be preferable if you
have reasons to believe that this estimate is more correct than the
one obtained by swapping A and B. The original paper doesn't deal with
that case simply because they don't mention how P(A=x) and P(B=y)
are obtained at all. The postgres estimator, on the other hand,
knows quite well how it derived P(A=x) and P(B=y) and may have much
higher confidence in one value than in the other.
Assume for example that you're preparing the statement
SELECT * FROM T WHERE A = ? AND B = 1
We'll then estimate P(A=?) as 1/dist(A), since we cannot do any better
without an actual value for the parameter "?". The estimate for P(B=1),
on the other hand, can use the histogram, and will thus very likely be
much more precise. The two estimates for P(A=?,B=1) in this case are
P(A=?,B=1)*P(B=1) = dist(B)*P(B=1)/dist(A,B), and
P(B=1,A=?)*P(A=1) = dist(A)*P(A=?)/dist(A,B).
There's a good chance that the former beats the latter, and thus also
the average, then.
best regards,
Florian Pflug
From | Date | Subject | |
---|---|---|---|
Next Message | Erik Rijkers | 2010-12-21 15:51:53 | Re: Extensions, patch 22 (cleanup, review, cleanup) |
Previous Message | tv | 2010-12-21 14:51:29 | Re: proposal : cross-column stats |