Re: Cross-column statistics revisited

From: "Joshua Tolley" <eggyknap(at)gmail(dot)com>
To: "Gregory Stark" <stark(at)enterprisedb(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cross-column statistics revisited
Date: 2008-10-15 16:13:34
Message-ID: e7e0a2570810150913x16c4fa4eic06a17829def5968@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Oct 15, 2008 at 7:51 AM, Gregory Stark <stark(at)enterprisedb(dot)com> wrote:
> "Joshua Tolley" <eggyknap(at)gmail(dot)com> writes:
>
>> I've been interested in what it would take to start tracking
>> cross-column statistics. A review of the mailing lists as linked from
>> the TODO item on the subject [1] suggests the following concerns:
>>
>> 1) What information exactly would be tracked?
>> 2) How would it be kept from exploding in size?
>> 3) For which combinations of columns would statistics be kept?
>
> I think then you have
>
> 4) How would we form estimates from these stats

That too, but see below.

>> The major concern in #1 seemed to be that the most suitable form for
>> keeping most common value lists, histograms, etc. is in an array, and
>> at the time of the posts I read, arrays of composite types weren't
>> possible. This seems much less of a concern now -- perhaps in greatest
>> part because a test I just did against a recent 8.4devel sure makes it
>> look like stats on composite type columns aren't even kept. The most
>> straightforward is that we'd keep a simple multi-dimensional
>> histogram, but that leads to a discussion of #2.
>
> "multi-dimensional histogram" isn't such a simple concept, at least not to me.
>
> Histograms aren't a bar chart of equal widths and various heights like I was
> taught in school. They're actually bars of various widths arranged such that
> they all of the same heights.

That depends on whose definition you've got in mind. Wikipedia's
version (for whatever that's worth) defines it as variable width
columns of not-necessarily-uniform heights, where the area of each bar
is the frequency. The default behavior of the hist() function in the R
statistics package is uniform bar widths, which generates complaints
from purists. Semantics aside, it looks like pgsql's stats stuff
(which I realize I'll need to get to know lots better to undertake
something like this) uses a list of quantiles, which still only makes
sense, as I see it, when a distance measurement is available for the
data type. The multidimensional case might do better with a frequency
count, where we'd store a matrix/cube/etc. with uniformly-sized units,
probably compressed via some regression function. That said, a
multidimensional quantile list would also be possible, compressed
similarly.

Provided a frequency chart or quantile set, it seems estimates can be
made just as they are in one dimension, by finding the right value in
the quantile or frequency matrix.

> It's not clear how to extend that concept into two dimensions. I imagine
> there's research on this though. What do the GIST statistics functions store?

No idea.

- Josh / eggyknap

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2008-10-15 16:17:35 Re: Column level triggers
Previous Message Tom Lane 2008-10-15 16:03:34 Re: Annoying error messages in _dosmaperr