Re: Thinking About Correlated Columns (again)

From: Craig James <cjames(at)emolecules(dot)com>
To: sthomas(at)optionshouse(dot)com
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Thinking About Correlated Columns (again)
Date: 2013-05-15 16:23:07
Message-ID: CAFwQ8rcSrwfJRgtP9GV_t-hKRR4=PmVeqCuGJ7wEnnO-9PD0wA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Wed, May 15, 2013 at 8:31 AM, Shaun Thomas <sthomas(at)optionshouse(dot)com>wrote:

> [Inefficient plans for correlated columns] has been a pain point for quite
> a while. While we've had several discussions in the area, it always seems
> to just kinda trail off and eventually vanish every time it comes up.
>
> ...
> Since there really is no fix for this aside from completely rewriting the
> query or horribly misusing CTEs (to force sequence scans instead of bloated
> nested loops, for example)...
> I'm worried that without an easy answer for cases like this, more DBAs
> will abuse optimization fences to get what they want and we'll end up in a
> scenario that's actually worse than query hints. Theoretically, query hints
> can be deprecated or have GUCs to remove their influence, but CTEs are
> forever, or until the next code refactor.
>
> I've seen conversations on this since at least 2005. There were even
> proposed patches every once in a while, but never any consensus. Anyone
> care to comment?
>

It's a very hard problem. There's no way you can keep statistics about all
possible correlations since the number of possibilities is O(N^2) with the
number of columns.

We have an external search engine (not part of Postgres) for a
domain-specific problem that discovers correlations dynamically and does
on-the-fly alterations to its search plan. Our queries are simple
conjunctions (A & B & C ...), which makes "planning" easy: you can do
indexes in any order you like. So, the system watches each index's
performance:

1. An index that's rarely rejecting anything gets demoted and eventually
is discarded.
2. An index that's rejecting lots of stuff gets promoted.

Rule 1: If two indexes are strongly correlated, which ever happens to be
first in the plan will do all the work, so the second one will rarely
reject anything. By demoting it to later in the plan, it allows more
selective indexes to be examined first. If an index ends up rejecting
almost nothing, it is discarded.

Rule 2: If an index rejects lots of stuff and you promote it to the front
of the plan, then "anti-correlated" indexes (those that reject different
things) tend to move to the front of the plan, resulting in very efficient
indexing.

Typically, our query plans start with roughly 20-100 indexes. The nature
of our domain-specific problem is that the indexes are highly correlated
(but it's impossible to predict in advance; each query causes unique
correlations.) Within a few thousand rows, it usually has dropped most of
them, and finishes the query with 5-10 indexes that are about 95% as
selective, but much faster, than the original plan.

But there are two caveats. First, our particular query is a simple
conjunction (A & B & C ...). It's easy to shuffle the index orders around.

Second, databases tend to be non-homogeneous. There are clusters of
similar rows. If you blindly optimize by going through a table in storage
order, you may optimize strongly for a highly similar group of rows, then
discover once you get out of that section of the database that your
"optimization" threw out a bunch of indexes that would be selective in
other sections of the database. To solve this, you have to add a mechanism
that examines the database rows in random order. This tends to minimize
the problem (although in theory it could still happen).

A more typical query (i.e. anything past a simple conjunction) would be
much more difficult for a dynamic optimizer.

And I can't see how you can expect a static planner (one that doesn't do
on-the-fly modifications to the plan) to find correlations.

Craig

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Shaun Thomas 2013-05-15 16:27:29 Re: Thinking About Correlated Columns (again)
Previous Message Heikki Linnakangas 2013-05-15 15:52:22 Re: Thinking About Correlated Columns (again)