From: | Ron Mayer <ron(at)intervideo(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | pgsql-performance(at)postgresql(dot)org |
Subject: | Re: Does "correlation" mislead the optimizer on large |
Date: | 2003-01-24 19:36:50 |
Message-ID: | Pine.LNX.4.44.0301241123310.986-100000@localhost.localdomain |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
On Fri, 24 Jan 2003, Tom Lane wrote:
>
> Ron Mayer <ron(at)intervideo(dot)com> writes:
> > On a large tables, I think the "correlation" pg_stats field as calculated
> > by "vacuum analyze" or "analyze" can mislead the optimizer.
>
> If you look in the pghackers archives, you will find some discussion
> about changing the equation that cost_index() uses to estimate the
> impact of correlation on indexscan cost. The existing equation is
> ad-hoc and surely wrong, but so far no one's proposed a replacement
> that can be justified any better. If you've got such a replacement
> then we're all ears...
I've got a very slow one (full table scan perl script) that helps
my database... I don't know if it's a good general purpose solution.
That's why I'm asking if the concept is good here. :-)
> > In particular, if I have a large table t with columns 'a','b','c', etc,
> > and I cluster the table as follows:
> > create table t_ordered as select * from t order by a,b;
> > vacuum analyze t_ordered;
> > Column "b" will (correctly) get a very low "correlation" in
> > the pg_stats table -- but I think the optimizer would do better
> > assuming a high correlation because similar 'b' values are still
> > grouped closely on the same disk pages.
>
> How would that be? They'll be separated by the stride of 'a'.
In the case of date/time (for the queries I showed) the issue was
that 'a's were not at all unique so I had data like this:
dat | time | value
------------|----------|--------------------------------
2002-01-01 | 00:00:00 | whatever
2002-01-01 | 00:00:00 |
2002-01-01 | 00:00:00 |
2002-01-01 | 00:00:01 |
2002-01-01 | 00:00:01 | [many pages of 12am]
2002-01-01 | 00:00:01 |
2002-01-01 | 00:00:01 |
... thousands more rows....
2002-01-01 | 00:00:59 |
2002-01-01 | 00:01:00 | [many pages of 1am]
... tens of thousands of rows.
2002-01-01 | 23:59:59 |
2002-01-01 | 23:59:59 |
2002-01-01 | 23:59:59 | [many pages of 11pm]
2002-01-02 | 00:00:00 | [many *MORE* pages of 12am]
2002-01-02 | 00:00:00 |
2002-01-02 | 00:00:00 |
... tens of thousands of rows...
2002-01-02 | 23:59:59 | [many pages of 11pm]
2002-01-03 | 00:00:00 | [many *MORE* pages of 12am]
... millions more rows ...
A similar problem actually shows up again in the dimention tables
of my database; where I bulk load many pages at a time (which can
easily be ordered to give a good correlation for a single load) ...
but then the next week's data gets appended to the end.
id | value
------|----------------------------------
1 | aalok mehta [many pages of all 'a's]
2 | aamir khan
3 | aaron beall
| [...]
6234 | axel rose
6234 | austin wolf
6123 | barbara boxer [many pages of all 'b's]
| [...]
123456 | young
123457 | zebra
| [...data loaded later..]
123458 | aaron whatever [more pages of all 'a's]
123458 | aaron something else
123458 | aaron something else
| [...]
512344 | zelany
In this case I get many clustered blocks of "a" values, but these
clustered blocks happen at many different times across the table.
> It seems likely to me that a one-dimensional correlation statistic may
> be inadequate, but I haven't seen any proposals for better stats.
The idea is it walks the whole table and looks for more local
correlations and replaces the correlation value with a "good"
value if values "close" to each other on the disk are similar.
This way a single "correlation" value still works ... so I didn't
have to change the optimizer logic, just the "analyze" logic.
Basically if data within each block is highly correlated, it doesn't
matter as much (yeah, I now the issue about sequential reads vs. random
reads).
Ron
From | Date | Subject | |
---|---|---|---|
Next Message | Ron Mayer | 2003-01-24 20:04:12 | Re: Does "correlation" mislead the optimizer on large |
Previous Message | Josh Berkus | 2003-01-24 19:20:14 | Mount options for Ext3? |