Quick Links

Re: multi-column index

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Manfred Koizar <mkoi-pg(at)aon(dot)at>
Cc:	David Brown <time(at)bigpond(dot)net(dot)au>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: multi-column index
Date:	2005-03-17 18:15:32
Message-ID:	25832.1111083332@sss.pgh.pa.us
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

Manfred Koizar <mkoi-pg(at)aon(dot)at> writes:
> On Wed, 16 Mar 2005 22:19:13 -0500, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> calculate the correlation explicitly for each index

> May be it's time to revisit an old proposal that has failed to catch
> anybody's attention during the 7.4 beta period:
> http://archives.postgresql.org/pgsql-hackers/2003-08/msg00937.php

> I'm not sure I'd store index correlation in a separate table today.
> You've invented something better for functional index statistics, AFAIR.

Well, the original motivation for calculating correlations on columns
was that historically, you didn't need to re-ANALYZE after creating an
index: the stats on the base table were already in place. So the idea
was to have the correlations already available whether or not the index
existed. This works fine for plain indexes on single columns ;-). We
didn't realize (or at least I didn't) how poorly the per-column stats
apply to multi-column indexes.

I am coming around to the view that we really do need to calculate
index-specific correlation numbers, and that probably does need a
special table ... or maybe better, add a column to pg_index. The column
in pg_statistic is useless and should be removed, because there isn't
any need for per-column correlation.

Now, as to the actual mechanics of getting the numbers: the above link
seems to imply reading the whole index in index order. Which is a
hugely expensive proposition for a big index, especially one that's
grown rather than been built recently --- the physical and logical
orderings of the index will be different. (Hm, maybe we need a stat
about the extent of disorder within the index itself?) We need a way
to get the number from a small sample of pages.

The idea I was toying with was to recalculate the index keys for the
sample rows that ANALYZE already acquires, and then compare/sort
those. This is moderately expensive CPU-wise though, and it's also not
clear what "compare/sort" means for non-btree indexes.

If we could get a correlation estimate by probing only a small fraction
of the index pages, that would work, but in a disordered index I'm not
sure how you figure out what you're looking at.

regards, tom lane

In response to

Re: multi-column index at 2005-03-17 08:51:36 from Manfred Koizar

Responses

Re: multi-column index at 2005-03-18 10:42:23 from Manfred Koizar

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Michael Fuhr	2005-03-17 18:26:53	Re: cpu_tuple_cost
Previous Message	Tom Lane	2005-03-17 17:57:20	Re: cpu_tuple_cost