Proposed Query Planner TODO items

From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Proposed Query Planner TODO items
Date: 2003-12-05 19:08:05
Message-ID: 200312051108.05868.josh@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

PG Folks,

What follows are a couple of proposed TODO items to make up for some of the
places our planner is weak compared to other leading databases.
Particularly, I'm personally concerned that as of 7.4.0 we would "fail" the
TPC benchmark even if someone sponsored us for it (see Issue #2 below).

I freely admit that I don't have the skill to implement either of these;
instead, I want them on the TODO list just so we don't lose track of them,
and just in case some new brilliant coder jumps into our community looking
for something to do.

1) MAINTAIN CORROLARY STATS ON FORIEGN KEYS

Summary: Keep correspondance statistics between FK columns.

Description: One of the areas of ongoing planner estimation problems
estimation of cross-table correspondence of column values. Indeed, as late
a 7.2.4 the WHERE EXISTS code just estimated a flat 50%.
While it would be practically impossible to maintain statistics between all
columns in a database that might possibly be compared, there is one class of
cross-table column comparisons which is both used heavily and is readily
identifiable: foriegn keys.
My proposal is to keep statistics on the correlation of values between the
key and foriegn key values in order to arrive at better estimates. Adapting
the newly committed pg_indexstats to track this as well seems to me to be the
easiest method, but I'll admit to not really understanding Manfried's code.

NOTE: This suggestion was dicussed on Hackers early last summer and received
general approval but somehow never ended up on the TODO list.

2) DEVELOP BETTER PLANS FOR "OR GROUP" QUERIES

Summary: Currently, queries with complex "or group" criteria get devolved by
the planner into canonical and-or filters resulting in very poor execution on
large data sets. We should find better ways of dealing with these queries,
for example UNIONing.

Description: While helping OSDL with their derivative TPC-R benchmark, we ran
into a query (#19) which took several hours to complete on PostgreSQL. It
was in the general form:

SELECT t1.a, t2.b
FROM t1, t2
WHERE t1.a = t2.a
AND (
( t1.c = x
AND t1.f IN (m, n, o)
AND t2.d = v
AND t2.e BETWEEN j AND k
)
OR
( t1.c = y
AND t1.f IN (n, o, p)
AND t2.d = v
AND t2.e BETWEEN k AND h
)
OR
( t1.c = z
AND t1.f IN (p, q)
AND t2.d = w
AND t2.e BETWEEN k AND h
)
)

The reason why this query is included in the TPC benchmarks is the reason I've
run into problems with similar querys before; it is the kind of query
produced by many 3rd-party decision-support and reporting applications. Its
distinguishing feature is the same thing which gives PG indigestion; the
distinct OR groups with a complex set of criteria for each.

Or planner's approach to this sort of query is to devolve the criteria into a
3-page long set of canonical and-or filters, and seq scan the entire
underlying data set. This is fine if the data set is small, but if it's
several times the size of RAM, a full-table seq scan is fatal, as it is for
TPC-R which seems specifically designed to test for this kind of failure.

One solution which suggests itself is that the following query form runs in a
couple of seconds:

SELECT t1.a, t2.b
FROM t1, t2
WHERE t1.a = t2.a
AND t1.c = x
AND t1.f IN (m, n, o)
AND t2.d = v
AND t2.e BETWEEN j AND k
UNION ALL
SELECT t1.a, t2.b
FROM t1, t2
WHERE t1.a = t2.a
AND t1.c = y
AND t1.f IN (n, o, p)
AND t2.d = v
AND t2.e BETWEEN k AND h
UNION ALL
SELECT t1.a, t2.b
FROM t1, t2
AND t1.c = z
AND t1.f IN (p, q)
AND t2.d = w
AND t2.e BETWEEN k AND h

So the trick would be teaching the planner to:
a) recognize an "or group query" when it sees one;
b) break down that query into a multi-part union and estimate the cost

However, I'm sure there are other possible solutions. Oracle and MSSQL have
solved this particular query problem; anybody know how they do it?

--
Josh Berkus
Aglio Database Solutions
San Francisco

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Josh Berkus 2003-12-05 20:14:37 Re: Proposed Query Planner TODO items
Previous Message Joe Conway 2003-12-05 19:01:41 Re: request for feedback - read-only GUC variables,