Re: proposal : cross-column stats

From: Florian Pflug <fgp(at)phlo(dot)org>
To: tv(at)fuzzy(dot)cz
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: proposal : cross-column stats
Date: 2010-12-21 15:24:39
Message-ID: 7E7AEE68-CF2C-40FE-B4C5-BB8374624F2E@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Dec21, 2010, at 15:51 , tv(at)fuzzy(dot)cz wrote:
>>> This is the reason why they choose to always combine the values (with
>>> varying weights).
>>
>> There are no varying weights involved there. What they do is to express
>> P(A=x,B=y) once as
>>
>> ...
>>
>> P(A=x,B=y) ~= P(B=y|A=x)*P(A=x)/2 + P(A=x|B=y)*P(B=y)/2
>> = dist(A)*P(A=x)/(2*dist(A,B)) +
>> dist(B)*P(B=x)/(2*dist(A,B))
>> = (dist(A)*P(A=x) + dist(B)*P(B=y)) / (2*dist(A,B))
>>
>> That averaging steps add *no* further data-dependent weights.
>
> Sorry, by 'varying weights' I didn't mean that the weights are different
> for each value of A or B. What I meant is that they combine the values
> with different weights (just as you explained).

I'm still not sure we're on the same page here. The resulting formula
is *not* a weighted average of P(A=x) and P(B=y), since in general
dist(A) + dist(B) = 2*dist(A,B) does *not* hold. It may look like one
syntactically, but that's about it.

The resulting formula instead is an *unweighted* (weights 1) average of
the two estimates P(B=y|A=x)*P(A=x) and P(A=x|B=y)*P(B=y). You might just
as well estimate P(A=x,B=y) with

P(B=y|A=x)*P(A=x) = dist(A)*P(A=x)/dist(A,B)

and it's *still* be the very same uniform bayesian approach, just no
longer symmetric in A and B. Which may easily be preferable if you
have reasons to believe that this estimate is more correct than the
one obtained by swapping A and B. The original paper doesn't deal with
that case simply because they don't mention how P(A=x) and P(B=y)
are obtained at all. The postgres estimator, on the other hand,
knows quite well how it derived P(A=x) and P(B=y) and may have much
higher confidence in one value than in the other.

Assume for example that you're preparing the statement

SELECT * FROM T WHERE A = ? AND B = 1

We'll then estimate P(A=?) as 1/dist(A), since we cannot do any better
without an actual value for the parameter "?". The estimate for P(B=1),
on the other hand, can use the histogram, and will thus very likely be
much more precise. The two estimates for P(A=?,B=1) in this case are

P(A=?,B=1)*P(B=1) = dist(B)*P(B=1)/dist(A,B), and
P(B=1,A=?)*P(A=1) = dist(A)*P(A=?)/dist(A,B).

There's a good chance that the former beats the latter, and thus also
the average, then.

best regards,
Florian Pflug

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Erik Rijkers 2010-12-21 15:51:53 Re: Extensions, patch 22 (cleanup, review, cleanup)
Previous Message tv 2010-12-21 14:51:29 Re: proposal : cross-column stats