Warehouse Schema

From: "Worky Workerson" <worky(dot)workerson(at)gmail(dot)com>
To: pgsql-sql(at)postgresql(dot)org
Subject: Warehouse Schema
Date: 2006-05-24 18:50:19
Message-ID: ce4072df0605241150j6781291bjbd64b74655607e44@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

I'm developing a schema for a large data warehouse (10 billion records) and
had a couple of questions about how to optimize it. My biggest question is
on the design of the fact table and how much normalization I should actually
do of the data.

The data is going to be keyed by IP address, stored as a non-unique IP4
type. Each record is also tagged with various attributes, such as a
category and a type. Assuming that a category and type are VARCHAR, would
it make sense to normalize these out of the fact table into their respective
tables and key them by an INTEGER? I.e.

CREATE TABLE big_fact_table_A (
identifier IP4,
data1 BYTEA,
data2 BYTEA,
...
dataN BYTEA,
category VARCHAR(16),
type VARCHAR(16)
);

... vs ...

CREATE TABLE big_fact_table_B (
identifier IP4,
data1 BYTEA,
data2 BYTEA,
...
dataN BYTEA,
category INTEGER REFERENCES categories (category_id),
type INTEGER REFERENCES types (type_id)
);

I figure that the normalized fact table should be quicker, as the integer is
much smaller than the varchar. On query, however, the table will need to be
joined against the two other tables (categories, types), but I still figure
that this is a win because the other tables are fairly small and should stay
resident in memory. Is this reasoning valid?

The downside to this (from my perspective) is that the data comes in the
form of big_fact_table_A and could be easily COPYed straight into the
table. with big_fact_table_B it looks like I will have to do the "unJOIN"
in a script. Also, I have separate installations of the warehouse (with
different data sources) and it will be difficult to share data between them
unless their categories/types tables are keyed with exactly the same integer
IDs which, as I don't directly control the other installations, is not
guaranteed.

Any suggestions to the above "problems"?

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Richard Huxton 2006-05-25 09:54:44 Re: Warehouse Schema
Previous Message Jorge Godoy 2006-05-24 17:56:43 Re: [SQL] (Ab)Using schemas and inheritance