From: | Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> |
---|---|
To: | Martijn van Oosterhout <kleptog(at)svana(dot)org> |
Cc: | Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: proposal : cross-column stats |
Date: | 2010-12-12 14:43:00 |
Message-ID: | 4D04DF74.1010706@enterprisedb.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 12.12.2010 15:17, Martijn van Oosterhout wrote:
> On Sun, Dec 12, 2010 at 03:58:49AM +0100, Tomas Vondra wrote:
> Very cool that you're working on this.
+1
>> Lets talk about one special case - I'll explain how the proposed
>> solution works, and then I'll explain how to make it more general, what
>> improvements are possible, what issues are there. Anyway this is by no
>> means a perfect or complete solution - it's just a starting point.
>
> It looks like you handled most of the issues. Just a few points:
>
> - This is obviously applicable to more than just integers, probably
> anything with a b-tree operator class. What you've coded seems rely
> on calculations on the values. Have you thought about how it could
> work for, for example, strings?
>
> The classic failure case has always been: postcodes and city names.
> Strongly correlated, but in a way that the computer can't easily see.
Yeah, and that's actually analogous to the example I used in my
presentation.
The way I think of that problem is that once you know the postcode,
knowing the city name doesn't add any information. The postcode implies
the city name. So the selectivity for "postcode = ? AND city = ?" should
be the selectivity of "postcode = ?" alone. The measurement we need is
"implicativeness": How strongly does column A imply a certain value for
column B. Perhaps that could be measured by counting the number of
distinct values of column B for each value of column A, or something
like that. I don't know what the statisticians call that property, or if
there's some existing theory on how to measure that from a sample.
That's assuming the combination has any matches. It's possible that the
user chooses a postcode and city combination that doesn't exist, but
that's no different from a user doing "city = 'fsdfsdfsd'" on a single
column, returning no matches. We should assume that the combination
makes sense.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2010-12-12 15:43:16 | Re: function attributes |
Previous Message | Oleg Bartunov | 2010-12-12 14:32:01 | Re: Extensions, patch v16 |