Thinking About Correlated Columns (again)

From: Shaun Thomas <sthomas(at)optionshouse(dot)com>
To: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Thinking About Correlated Columns (again)
Date: 2013-05-15 15:31:46
Message-ID: 5193AA62.2040700@optionshouse.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi!

This has been a pain point for quite a while. While we've had several
discussions in the area, it always seems to just kinda trail off and
eventually vanish every time it comes up.

A really basic example of how bad the planner is here:

CREATE TABLE foo AS
SELECT a.id, a.id % 1000 AS col_a, a.id % 1000 AS col_b
FROM generate_series(1, 1000000) a(id);

CREATE INDEX idx_foo_ab ON foo (col_a, col_b);

ANALYZE foo;

EXPLAIN ANALYZE
SELECT *
FROM foo
WHERE col_a = 50
AND col_b = 50;

Index Scan using idx_foo_ab on foo (cost=0.00..6.35 rows=1 width=12)
(actual time=0.030..3.643 rows=1000 loops=1)
Index Cond: ((col_a = 50) AND (col_b = 50))

Hey, look! The row estimate is off by a factor of 1000. This particular
case doesn't suffer terribly from the mis-estimation, but others do.
Boy, do they ever.

Since there really is no fix for this aside from completely rewriting
the query or horribly misusing CTEs (to force sequence scans instead of
bloated nested loops, for example), I recently tried to use some
existing ALTER TABLE syntax:

ALTER TABLE foo ALTER col_a SET (n_distinct = 1);
ALTER TABLE foo ALTER col_b SET (n_distinct = 1);

The new explain output:

Index Scan using idx_foo_ab on foo (cost=0.00..9.37 rows=2 width=12)
(actual time=0.013..0.871 rows=1000 loops=1)
Index Cond: ((col_a = 50) AND (col_b = 50))

Well... 2 is closer to 1000 than 1 was, but it's pretty clear the
histogram is still the basis for the cost estimates. I'm curious what
benefit overriding the n_distinct pg_stats field actually gives us in
practice.

I'm worried that without an easy answer for cases like this, more DBAs
will abuse optimization fences to get what they want and we'll end up in
a scenario that's actually worse than query hints. Theoretically, query
hints can be deprecated or have GUCs to remove their influence, but CTEs
are forever, or until the next code refactor.

I've seen conversations on this since at least 2005. There were even
proposed patches every once in a while, but never any consensus. Anyone
care to comment?

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
sthomas(at)optionshouse(dot)com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Heikki Linnakangas 2013-05-15 15:52:22 Re: Thinking About Correlated Columns (again)
Previous Message ktm@rice.edu 2013-05-15 13:52:28 Re: RT3.4 query needed a lot more tuning with 9.2 than it did with 8.1