A Guide to Constraint Exclusion (Partitioning)

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: bizgres-general(at)pgfoundry(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: A Guide to Constraint Exclusion (Partitioning)
Date: 2005-07-13 10:53:17
Message-ID: 1121251997.3970.237.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

A Guide to the Constraint Exclusion Feature
===========================================

Simon Riggs 2ndQuadrant simon(at)2ndquadrant(dot)com

INTRODUCTION

Constraint Exclusion (CE) is an optimizer patch submitted for PostgreSQL
8.1. CE aims to greatly improve the performance for certain types of
common queries against large tables. CE extends the logic originally
developed for Partial Indexes to allow the optimizer to avoid scanning
particular tables for certain queries. CE will exclude a table from the
query plan when the query's WHERE clause is proven not to "overlap" with
any of the Constraints defined on those tables.

No new syntax is required to take advantage of this feature, which is
designed to work in conjunction with existing Inheritance features. Some
syntax *is* required to declare Constraints, which are already
ANSI/ISO SQL:2003 compliant.

In other RDBMS these features are often referred to as Partitioning. The
PostgreSQL CE feature is much more widely applicable than that name
might suggest and will be of benefit to many users, not just for
Business Intelligence applications.

The aim of this guide is to help prospective users understand how the
feature will work, to allow improvement during the 8.1 beta test cycle.

All feedback is welcome.

TABLE PARTITIONING OVERVIEW

Let's look at a practical example of how this works and the performance
benefits it provides:

In many BI workloads there are a small number of very large tables,
often referred to as Fact tables. In order to improve load and query
performance against those tables it is often useful to split these
larger tables into pieces, known as partitions.

PostgreSQL 8.0 could support pseudo-partitioning using the Inheritance
feature. Here's a BI Fact table for a Retail DW as an example of that:

CREATE TABLE Sales_DateItemOutlet
( DateKey Integer
, OutletKey Integer
, ItemKey Integer
, SoldQty Integer
);

This table could be accessed using a query like
SELECT sum(soldqty) FROM Sales_DateItemOutlet
WHERE DateKey between 20050101 and 20050101

This can then be split up so that we have a number of sub-tables:

CREATE TABLE Sales_Jan_DateItemOutlet
() INHERITS (Sales_DateItemOutlet);

CREATE TABLE Sales_Feb_DateItemOutlet
() INHERITS (Sales_DateItemOutlet);

CREATE TABLE Sales_Mar_DateItemOutlet
() INHERITS (Sales_DateItemOutlet);

When we load this we put January's data in the Jan table etc, though put
no rows at all in the "parent" table.

Now, if we run our example query again
SELECT sum(soldqty) FROM Sales_DateItemOutlet
WHERE DateKey between 20050101 and 20050101
we find that the query will
Scan all rows in Sales_DateItemOutlet (which is empty)
Scan all rows in Sales_Jan_DateItemOutlet
Scan all rows in Sales_Feb_DateItemOutlet
Scan all rows in Sales_Mar_DateItemOutlet
and return the correct answer. But we know that the query did not really
need to have scanned the Feb and Mar tables, since these do not contain
any data that would satisfy the query.

(The full EXPLAIN output is not shown above, for clarity only.)

The new CE functionality aims to improve the performance of such
queries. Ideally we would like the query to
Scan all rows in Sales_Jan_DateItemOutlet
and ignore the parent table, and all other child tables.

To do allow this, we must provide more declarative information to allow
the optimizer to understand as much as we do. To do this, we add
constraints on to each table, so that it is clear what rows they can
contain.

CREATE TABLE Sales_Jan_DateItemOutlet
( CHECK (DateKey BETWEEN 20050101 AND 20050131) )
INHERITS (Sales_DateItemOutlet);

CREATE TABLE Sales_Feb_DateItemOutlet
( CHECK (DateKey BETWEEN 20050201 AND 20050229) )
INHERITS (Sales_DateItemOutlet);

CREATE TABLE Sales_Mar_DateItemOutlet
( CHECK (DateKey BETWEEN 20050301 AND 20050331) )
INHERITS (Sales_DateItemOutlet);

Now, when we execute our test query, the optimizer can understand that
the Feb and Mar tables can *never* hold rows that would match our query.
As a result it is able to provably exclude them from the query plan.

Thus, with Constraint Exclusion enabled our test query
SELECT sum(soldqty) FROM Sales_DateItemOutlet
WHERE DateKey between 20050101 and 20050101
will perform the following scans
Scan all rows in Sales_DateItemOutlet (which is empty)
Scan all rows in Sales_Jan_DateItemOutlet

Running an EXPLAIN or EXPLAIN ANALYZE will allow you to see which tables
have been included in the query. There is no explicit message to say
that a table has been excluded.

We aren't able to exclude the parent table from the above query because
no Constraint was defined upon it. Since, in our example, the parent is
empty there will be little effect on the query performance. It would be
a mistake to attempt to get around this by placing a Constraint on the
parent, since that would then automatically be created on all child
tables also. So we can never exclude the parent without automatically
excluding *all* of the children also.

Overall then, if the partitions are all the same size, we will have
reduced execution time of the query to around one-third of its previous
elapsed time. If we had 10 or 100 child tables, certain queries could be
10 or 100 times faster than without the CE feature.

In summary, the CE feature will be a huge performance gain for
qualifying queries against large tables in PostgreSQL databases. Since
no new syntax is required, existing PostgreSQL databases designed to
take advantage of inheritance will automatically benefit from the
performance enhancements.

CONSTRAINT EXCLUSION

If you wish to enable this feature you must set the server parameter
enable_constraint_exclusion = true
This is a user-settable parameter, so could be enabled for specific
users or queries.

Constraint exclusion is not enabled by default. This is because
PostgreSQL does not yet have full plan invalidation when DDL changes are
made. If a change were made to any of the tables, the query would not
automatically re-optimize, so it would be possible to get an incorrect
answer returned from a CE plan. This is not a fault of CE, but simply a
side-effect of the current lack of full plan invalidation in PostgreSQL.

CE will work only when the query contains a direct and simple
restriction of the table, such as
Attribute > Constant

If Operators are available for that datatype, CE will use
= < <= > >= <>
These operators must be defined as IMMUTABLE.

CE will work for most simple Constraints. Constraints are already
limited to IMMUTABLE predicates, such as Attribute > Constant.

Constraints like
Attribute BETWEEN Const1 and Const2
can be used to produce ranges of values, and would be generally
described as RANGE PARTITIONING.

Constraints like
Attribute IN (Const1, Const2, Const3 ...)
can be used to produce lists of values, and would generally be described
as LIST PARTITIONING.

All of these Constraint types could be mixed, to allow Constraints like
Attribute1 IN (Const1, Const2)
AND Attribute2 >= Const3 AND Attribute2 < Const4

Thus, the CE feature allows a very flexible partitioning scheme to be
developed that could mix LIST, RANGE style partitioning clauses.

Currently, there is no restriction that all constraints *must* be
mutually exclusive, nor even that the constraints may be similar on each
table. This can be useful for some designs where the inheritance
hierarchy is not "disjoint", as UML would term this situation.

CE does not prevent direct access to one of the child tables in an
inheritance hierarchy. In this case, no exclusion test would be
performed. Exclusion tests are performed *only* when the parent table in
an inheritance hierarchy is accessed. Exclusion tests are performed even
if the inheritance hierarchy is many levels deep (e.g. parent-child-
grandchild). CE also supports multiple inheritance.

PostgreSQL makes no restriction upon what indices are defined on the
various child tables. CE is not dependent upon the existence or absence
of indices, only upon the Query's WHERE clause and the Constraints
defined upon the tables.

CE can be very useful for large historical databases. As an example,
PostgreSQL would allow a historical data table split into sections like
this:
- Last 3 months: one table per week, each table defined with 3 indices
- 3-12 months: one table per month, no indices defined
- 12-36 months: tablespaces defined on hierarchical storage (near-line),
with no indices, so de-archived when required for use.

The CE feature would allow the optimizer to avoid wasteful de-archiving
of the older data, with an appropriate database/query design.

The current patch also includes a suite of 700 tests that successfully
exclude child tables in all common query types (whilst returning the
correct answer!). These include examples of List, Range and mixed
partitioning scenarios.

CURRENT RESTRICTIONS

It is not yet possible to specify that Constraints on child tables will
be mutually exclusive of each other. Currently, it would be up to the
designer to ensure that, if desired.

It is not yet possible to specify that an inheritance parent has no
rows, and, if so, should always be excluded from the query.

If a parent table has a Constraint defined upon it, then this will be
automatically copied to all child tables. Currently, there is no way to
tell which Constraints have been inherited from the parent, so exclusion
tests will be re-executed against all child tables. This will cause
additional optimization time.

Currently, all child tables will be considered. It may be possible in
the future to pre-sort the list of child tables, so that optimization
time can be reduced for parent tables with large numbers of partitions.

Currently, there is no index on the pg_inherits system table. As a
result, parents with more than 1000 child tables are likely to
experience longer than desirable planning times for their queries.

CE checks will not currently recognise STABLE functions within a query.
So WHERE clauses such as
DateKey > CURRENT DATE
will not cause exclusion because CURRENT DATE is a STABLE function.

CE checks are not made when the parent table is involved in a join.

Other existing restrictions on Inherited tables continue to apply.

Further enhancements to the CE feature can be expected in the future.

COMPARISONS WITH OTHER RDBMS

In brief:

Teradata and Oracle already have table Partitioning. Oracle's
partitioning allows either LIST or RANGE partitioning. The new
PostgreSQL CE feature is considerably more flexible than Oracle's
declarative Partitioning syntax, and is achieved without adding
non-standard SQL extensions.

Both Sybase and MS SQLServer 2005 have new Partitioning features
released this year. The Microsoft feature does allow joins between
tables to exclude partitions. However, it also requires that all
partitions should have identical indexes, which is a major blocking
factor to the use of very large tables.

DB2 has partitioning using UNION ALL views, which includes somewhat
similar functionality to the PostgreSQL CE feature. DB2 uses the
PARTITION keyword to signify a different type of feature, so do not be
confused that they do have this feature with declarative syntax.

=================================================================

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Richard Huxton 2005-07-13 13:02:47 Re: A Guide to Constraint Exclusion (Partitioning)
Previous Message Simon Riggs 2005-07-13 08:29:53 Re: Vacuum summary?