Re: proposal : cross-column stats

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: proposal : cross-column stats
Date: 2010-12-12 15:51:36
Message-ID: 4D04EF88.1010200@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi!

Dne 12.12.2010 15:17, Martijn van Oosterhout napsal(a):
>> Lets talk about one special case - I'll explain how the proposed
>> solution works, and then I'll explain how to make it more general, what
>> improvements are possible, what issues are there. Anyway this is by no
>> means a perfect or complete solution - it's just a starting point.
>
> It looks like you handled most of the issues. Just a few points:
>
> - This is obviously applicable to more than just integers, probably
> anything with a b-tree operator class. What you've coded seems rely
> on calculations on the values. Have you thought about how it could
> work for, for example, strings?

Yes, I know, I just forgot to address this in my previous e-mail. The
contingency tables have a really nice feature - they are based on
splitting the sets into groups (~ bins of the histograms for each
column). And this can be done if you can sort the values, you really
don't need any calculations. So it should work with strings.

And another thing I somehow forgot is handling the case when there is no
histogram, just MCV. That's mostly the same - each of the values might
be a separate group, or the values might be grouped to form less groups,
etc.

> The classic failure case has always been: postcodes and city names.
> Strongly correlated, but in a way that the computer can't easily see.
> Not that I suggest you fix this, but it's food for though. Though
> strictly speaking this is a different kind of correlation than what
> you're looking at.

Hmmm, I see. I think the proposal does not fix this particular case,
although it might improve the situation a little bit (limit the error
between expected and observed number of rows).

The problem is that once we get to a cell-level of the contingency
table, there is no additional (more detailed) information. So we're
stuck with the multiplication estimate, or something like that.

I was thinking about it actually, and I think we could collect some more
info - a correlation coefficient for each bin, or something like that.
But that was not part of my proposal, and I'm not sure how to do that.

>> 2) I really don't think we should collect stats for all combinations of
>> columns of a table - I do like the Oracle-like approach where a DBA
>> has to enable cross-column stats using an ALTER TABLE (for a
>> particular list of columns).
>>
>> The only exception might be columns from a multi-column index. It
>> might be quite efficient I guess?
>
> In the past it has been suggested to only do it for multi-column
> indexes, but I find these days I find in some situations I prefer to
> make individual indexes and let the bitmap scan code combine them. So
> perhaps it would be best to let it be configured by the DBA.

Yes, I prefer individual indexes too.

The idea behind collecting cross-column stats for multi-column indexes
was that maybe we could 'append' this to the current functionality
(building the index or something like that) so that it does not
introduce significant performance problems.

>> 3) There are independence tests for contingency tables (e.g. Pearson's
>> Chi-squared test), so that it's easy to find out whether the columns
>> are independent. In that case we can just throw away these stats and
>> use the simple estimation.
>>
>> http://mathworld.wolfram.com/Chi-SquaredTest.html
>
> I think this would be good to include, if possible.
>
> Actually, I wonder if the existing stats collection code could be
> altered to attempt to calculate the correlation between columns as part
> of its other work.

I guess that would be rather expensive - to compute correlation you need
two passes, and you need to do that for each pair or columns. So I'd be
surprised if it is possible (and effective).

Another thing is that you can compute correlation only for numeric
columns, so it's not possible to do that for city/ZIP code mentioned
above. More precisely - it's possible to do that (if you map strings to
numbers somehow), but I doubt you'll get useful results as the
assignment is rather random.

Well, you could ask the governments to assign the ZIP codes to cities in
strictly alphabecital order, but I guess they'll say no.

>> 4) Or we could store just those cells where expected and observed values
>> differ significantly (may help if most of the values are indendent,
>> but there's a small glitch somewhere).
>
> Comrpessing that grid would be useful, given that for many dimensions
> most of the grid will be not interesting. In fact, storing the 20
> largest values may be enough. Worth an experiment.

Not exactly just the largest values - rather values that are
significantly different from the expected values. Generally there are
two interesting cases

expected << observed - The optimizer may choose index scan, although
the seq scan would be better.

expected >> observed - The optimizer may choose seq scan, although
the index scan would be better.

regards
Tomas

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2010-12-12 16:01:37 Re: Problem with pg_upgrade (8.4 -> 9.0) due to ALTER DATABASE SET ROLE
Previous Message Tom Lane 2010-12-12 15:43:16 Re: function attributes