Cross-field statistics

From: Decibel! <decibel(at)decibel(dot)org>
To: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Cross-field statistics
Date: 2008-04-17 14:38:02
Message-ID: 14773AA0-AF89-4944-8EF6-53DDA524DB9F@decibel.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I just had an idea about how to create cross-field statistics, which
could greatly improve the quality of estimates involving multiple
conditions on one table. This is rather arm-wavy, but I wanted to at
least get the idea out...

If we built a table via

CREATE TABLE moo AS SELECT i, i*2 AS j FROM generate_series(1,9999) i;

Then it would be nice if the planner produced the same estimate for
all of these:

SELECT * FROM moo WHERE i>8888 AND j>8888*2;
SELECT * FROM moo WHERE i>8888 OR j>8888*2;
SELECT * FROM moo WHERE i>8888;
SELECT * FROM moo WHERE j>8888*2;

It only actually gets the last 2 correct (1117 rows, close enough to
the actual 1111 rows). On my laptop, it guesses 125 for the AND case
and 2109 for the OR case. The problem is that it doesn't know how
closely i and j are related. But in this (contrived) example, it
actually *could* make an inference between these two columns, because
each field has a correlation of 1. That means that you can actually
compute how much those two conditions will overlap by comparing how
much they overlap in the histogram that's stored in pg_stats. As a
first pass, it might be worth having the planner actually take this
simple case into account.

For all the other fields, what if ANALYZE constructed artificial
correlation orderings? We don't actually care about how well these
artificial correlations correspond to physical table ordering, we
only care about how many fields line up with a particular artificial
ordering. What I'm proposing is that once we have our set of sample
records in ANALYZE:

For each field that isn't already in a set of field groupings
* Sort sample rows on that field
* Calculate correlation for all other fields
* If there are other fields that have a correlation to this sort
order over some threshold, save them along with the field we
originally sorted on as a new 'field grouping'
* Else, there are no other fields that group with this field; it's
a "loner"

For each field grouping, at a minimum we'd need to store a histogram
for that grouping. It might be worth looking at how things change
when we sort on different fields in the grouping... the lower the
correlation threshold used to identify groupings, the more
variability there will be. I think we'd also want to consider how
well each field in the grouping correlated to that grouping. It might
also be worth iteratively dropping the correlation threshold and
searching again for groupings. At some point we lose the ability to
draw meaningful conclusion from the information, but I'd expect
there's some way we can calculate epsilon for different groupings and
take that into account with query plans.

The important thing is that this scheme adds less than O(n) (n being
the number of fields), and not O(n^2), both in terms of ANALYZE (ok,
maybe not entirely true since presumably we don't do any sorting
there right now) and in terms of storing statistics. I'm not sure
what it would do to the planner; the entire key there would be
identifying field groupings that covered sets of fields in the WHERE
clause.
--
Decibel!, aka Jim C. Nasby, Database Architect decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2008-04-17 14:46:09 Re: Lessons from commit fest
Previous Message Tom Lane 2008-04-17 14:36:59 Re: get rid of psql welcome message