From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | Nathan Boley <npboley(at)gmail(dot)com> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, David Fetter <david(at)fetter(dot)org>, PG Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Multi-Dimensional Histograms |
Date: | 2009-06-30 02:22:15 |
Message-ID: | 603c8f070906291922s34f786c4k683860696354a456@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Jun 29, 2009 at 8:17 PM, Nathan Boley<npboley(at)gmail(dot)com> wrote:
> On Mon, Jun 29, 2009 at 3:43 PM, Tom Lane<tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> David Fetter <david(at)fetter(dot)org> writes:
>>> On Mon, Jun 29, 2009 at 01:28:01PM -0700, Nathan Boley wrote:
>>>> ... They dismiss
>>>> singular value decomposition and the discrete wavelet transform as
>>>> being too parametric ( which is silly, IMHO )
>>
>>> Should we have a separate discussion about eigenvalues? Wavelets?
>>
>> I think it'd be a short discussion: what will you do with non-numeric
>> datatypes? We probably don't really want to assume anything stronger
>> than that the datatype has a total ordering.
>
> Well, in the general case, we could use their ranks.
>
> At the end of the day, we cant do any dimension reduction unless the
> ordering encodes some sort of useful information, and the data type
> being in R^n is certainly no guarantee. Consider, for instance, the
> cross correlation of zip-codes and area codes - you would really want
> to order those by some geographic relation. I think that is why
> cross-column stats is so hard in the general case.
>
> That being said, for geographic data in particular, PCA or similar
> could work well.
I'm finding myself unable to follow all the terminology on this thead.
What's dimension reduction? What's PCA?
Based on my last few months of answering questions on -performance,
and my own experience, it seems like a lot of the cases that arise in
practice are those where there is a WHERE clause of the form:
colA = constA and colB op constB
...and it sometimes turns out that the subset of the data where colA =
constA has a very different distribution for colB than the data as a
whole, leading to bad plans. In many cases, it seems like colA is
storing some discrete type of thing, like a customer ID, so the
distribution of colB where colA = constA tells you nothing about the
distribution of colB where colA = constA + someSmallDeltaA. It feels
like what you might need is statistics for colB (MCVs and/or a
histogram) for certain particular values of colA. Unfortunately, in
the general case the set of values of colA for which you need these
statistics might be inconveniently large.
...Robert
From | Date | Subject | |
---|---|---|---|
Next Message | KaiGai Kohei | 2009-06-30 03:23:44 | Re: [PATCH] [v8.5] Security checks on largeobjects |
Previous Message | Bruce YUAN | 2009-06-30 02:10:52 | Re: How to register my function into backend? |