partitioning versus clustering

From: "Andrew W(dot) Gibbs" <awgibbs(at)awgibbs(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: partitioning versus clustering
Date: 2012-11-18 17:14:02
Message-ID: 20121118171402.GA30233@raptor.commandosoftware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Postgres Performance Wizards,

I am a few years into developing and operating a system underpinned by
Postgres that sees the arrival a significant number of events around
the clock, not an epic amount of data, but enough to be challenging to
me, in particular when system downtime is not very palatable and the
data is retained quasi-indefinitely.

I have various tables that house different kinds of events, and in
addition to wanting to look at a small number of rows of data, users
often want to generate summary reports on large swaths of data that
span days or weeks. At present, these reports can trigger index scans
that take minutes to service, and the parameters of the reports are
user specified, making their pre-generation infeasible. Generally the
rows in these tables are write-once, but they contain a pointer to the
related BLOB from which they were constructed, and every now and again
some new field in the originating BLOB becomes of interest, causing me
to alter the table and then do a sweep of the full table with
corresponding updates, violating the otherwise INSERT-only nature.

These event tables generally have an "event time" column that is
indexed and which is an obvious candidate for either partitioning or
clustering of the table. I'm trying to make sense of which is the
better option for me.

As best I can tell, the decision points are as follows...

PARTITIONING

Pros:

* no outage; data just starts flowing into new partitions seamlessly
* allows more control over where the data goes, creating retrieval parallelization opportunities
* "clustering" cannot be inadvertently undone in a way that requires scheduled downtime to repair
* probably more resilient in the case of the "event time" being different from the time that I processed the event

Cons:

* does not deal with legacy data without extra migration (over time this becomes less relevant)
* requires some kind of background process to manage partition creation
* partition size will affect performance and choosing its size is not a science

CLUSTERING

Pros:

* no particularly custom development work on my part
* once done, it puts all existing data in a good state for efficient querying without extra work

Cons:

* will lock up the system for the duration of the CLUSTER command
* somehow need to make sure that ANALYZE commands run often enough
* does not give me much control of the underlying storage layout
* may have problems when the occasional mass-UPDATE is done
* unclear whether a VACUUM FULL is required to prevent subsequent un-clustered-ness despite having a fill factor of 100, stemming from the mass-UPDATE operations
* could generate a huge number of WAL segments to archive
* could possibly be sabotaged by the "event time" property not being well correlated with the time that the event is processed in the face of upstream systems have momentary issues

As far as questions to the group go:

* Is my understanding of the pros and cons of the options reasonably correct and comprehensive?
* What has governed your decisions in making such a choice on past projects of your own?
* If I go the clustering route, will the occasional mass update really mess with things, requiring a re-cluster and possibly even a full vacuum (to prevent re-un-clustering)?
* Might it make more sense to cluster when the "event time" property is the time that I processed the event but partition when it is the time that the event occurred in some other system?
* Is running a CLUSTER command actually necessary to get the performance benefits if the table ought already be in a good order, or is just running a CLUSTER command on a well ordered table enough to get query execution to yield nice sequential access to the disk?

Many thanks in advance for your insights...

-- AWG

Browse pgsql-performance by date

  From Date Subject
Next Message Pavel Stehule 2012-11-20 07:27:36 Re: Query that uses lots of memory in PostgreSQL 9.2.1 in Windows 7
Previous Message Julien Cigar 2012-11-16 16:53:26 Re: PostgreSQL strange query plan for my query