Quick Links

Re: cross column correlation revisted

From:	Joshua Tolley <eggyknap(at)gmail(dot)com>
To:	Yeb Havinga <yebhavinga(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL - Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>
Subject:	Re: cross column correlation revisted
Date:	2010-07-14 13:54:57
Message-ID:	4c3dc1b4.0350e70a.18fa.39d4@mx.google.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Jul 14, 2010 at 01:21:19PM +0200, Yeb Havinga wrote:
> Heikki Linnakangas wrote:
>> However, the problem is how to represent and store the
>> cross-correlation. For fields with low cardinality, like "gender" and
>> boolean "breast-cancer-or-not" you can count the prevalence of all the
>> different combinations, but that doesn't scale. Another often cited
>> example is zip code + street address. There's clearly a strong
>> correlation between them, but how do you represent that?
>>
>> For scalar values we currently store a histogram. I suppose we could
>> create a 2D histogram for two columns, but that doesn't actually help
>> with the zip code + street address problem.
> In my head the neuron for 'principle component analysis' went on while
> reading this. Back in college it was used to prepare input data before
> feeding it into a neural network. Maybe ideas from PCA could be helpful?

I've been playing off and on with an idea along these lines, which builds an
empirical copula[1] to represent correlations between columns where there
exists a multi-column index containing those columns. This copula gets stored
in pg_statistic. There are plenty of unresolved questions (and a crash I
introduced and haven't had time to track down), but the code I've been working
on is here[2] in the multicolstat branch. Most of the changes are in
analyze.c; no user-visible changes have been introduced. For that matter,
there aren't any changes yet actually to use the values once calculated (more
unresolved questions get in the way there), but it's a start.

[1] http://en.wikipedia.org/wiki/Copula_(statistics)
[2] http://git.postgresql.org/gitweb?p=users/eggyknap/postgres.git

--
Joshua Tolley / eggyknap
End Point Corporation
http://www.endpoint.com

In response to

Re: cross column correlation revisted at 2010-07-14 11:21:19 from Yeb Havinga

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2010-07-14 14:01:26	Re: cross column correlation revisted
Previous Message	Pavel Stehule	2010-07-14 13:12:48	Re: patch: preload dictionary new version