Re: Data Warehouse Reevaluation - MySQL vs Postgres --

From: Joe Conway <mail(at)joeconway(dot)com>
To: simon(at)2ndquadrant(dot)com
Cc: Chris Browne <cbbrowne(at)acm(dot)org>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Data Warehouse Reevaluation - MySQL vs Postgres --
Date: 2004-09-15 15:15:50
Message-ID: 41485CA6.1080505@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

simon(at)2ndquadrant(dot)com wrote:
> Joe Conway <mail(at)joeconway(dot)com> wrote on 15.09.2004, 06:30:24:
>>We're not completely done with our data conversion (from a commercial
>>RDBMSi), but so far the results have been excellent. Similar to what
>>others have said in this thread, the conversion involved restructuring
>>the data to better suit Postgres, and the application (data
>>analysis/mining vs. the source system which is operational). As a result
>>we've compressed a > 1TB database down to ~0.4TB, and seen at least one
>>typical query reduced from ~9 minutes down to ~40 seconds.
>
> Sounds interesting.
>
> The performance gain comes from partition elimination of the inherited
> tables under the root?
>
> I take it the compression comes from use of arrays, avoiding the need
> for additional rows and key overhead?

Sorry, in trying to be concise I was not very clear. I'm using the term
compression very generally here. I'll try to give a bit more background,

The original data source is a database schema designed for use by an
operational application that my company sells to provide enhanced
management of equipment that we also sell. The application needs to be
very flexible in exactly what data it stores in order to be useful
across a wide variety of equipment models and versions. In order to do
that there is a very large central "transaction" table that stores
name->value pairs in varchar columns. The name->value pairs come from
parsed output of the equipment, and as such there is a fair amount of
redundancy and unneeded data that ends up getting stored. At each
installation in the field this table can get very large (> billion
rows). Additionally the application prematerializes a variety of
summaries for use by the operators using the GUI.

We collect the data exported from each of the systems in the field and
accumulate it in a single central database for data mining and analysis.
This is the database that is actually being converted. By compression I
really mean that unneeded and redundant data is being stripped out, and
data known to be of a certain datatype is stored in that type instead of
varchar (e.g. values known to be int are stored as int). Also the
summaries are not being converted (although we do some post processing
to create new materialized summaries).

My points in telling this were:
- the use of inherited tables to partition this huge number of rows and
yet allow simple query access to it seems to work well, at least in
early validation tests
- had we simply taken the original database and "slammed" it into
Postgres with no further thought, we would not have seen the big
improvements, and thus the project might have been seen as a failure
(even though it saves substantial $)

Hope that's a bit more clear. I'm hoping to write up a more detailed
case study once we've cut the Postgres system into production and the
dust settles a bit.

Joe

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Jeffrey W. Baker 2004-09-15 16:11:37 Re: disk performance benchmarks
Previous Message Michael Paesold 2004-09-15 09:39:42 Re: disk performance benchmarks