Re: Warehouse Schema

From: Richard Huxton <dev(at)archonet(dot)com>
To: Worky Workerson <worky(dot)workerson(at)gmail(dot)com>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: Warehouse Schema
Date: 2006-05-25 09:54:44
Message-ID: 44757EE4.8040807@archonet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

Worky Workerson wrote:
> I'm developing a schema for a large data warehouse (10 billion records) and
> had a couple of questions about how to optimize it. My biggest question is
> on the design of the fact table and how much normalization I should
> actually
> do of the data.

As much as you can and still afford the hardware to support it.

> The data is going to be keyed by IP address, stored as a non-unique IP4
> type. Each record is also tagged with various attributes, such as a
> category and a type. Assuming that a category and type are VARCHAR, would
> it make sense to normalize these out of the fact table into their
> respective
> tables and key them by an INTEGER? I.e.
>
> CREATE TABLE big_fact_table_A (
> identifier IP4,
> data1 BYTEA,
> data2 BYTEA,
> ...
> dataN BYTEA,
> category VARCHAR(16),
> type VARCHAR(16)
> );
>
> ... vs ...
>
> CREATE TABLE big_fact_table_B (
> identifier IP4,
> data1 BYTEA,
> data2 BYTEA,
> ...
> dataN BYTEA,
> category INTEGER REFERENCES categories (category_id),
> type INTEGER REFERENCES types (type_id)
> );

Assuming categories and types are fixed sets this isn't really
normalisation, you're just substituting one representation with another.
That's assuming you enforce valid representations on table_A.

Of course, if the categories or types can change then you'll want to use
foreign-keys. Try int2 rather than int4 even (although check the
manuals for details to see if it'll pack them closer).

> I figure that the normalized fact table should be quicker, as the
> integer is
> much smaller than the varchar. On query, however, the table will need
> to be
> joined against the two other tables (categories, types), but I still figure
> that this is a win because the other tables are fairly small and should
> stay
> resident in memory. Is this reasoning valid?

That's the trade-off.

> The downside to this (from my perspective) is that the data comes in the
> form of big_fact_table_A and could be easily COPYed straight into the
> table. with big_fact_table_B it looks like I will have to do the "unJOIN"
> in a script. Also, I have separate installations of the warehouse (with
> different data sources) and it will be difficult to share data between them
> unless their categories/types tables are keyed with exactly the same
> integer
> IDs which, as I don't directly control the other installations, is not
> guaranteed.
>
> Any suggestions to the above "problems"?

If you're comfortable with a little "C" coding, write your own type.
That's really what you're trying to do here. Accept text input and
display text output but store in a compact form.

HTH
--
Richard Huxton
Archonet Ltd

In response to

Browse pgsql-sql by date

  From Date Subject
Next Message Alban Hertroys 2006-05-26 09:25:37 Re: [SQL] (Ab)Using schemas and inheritance
Previous Message Worky Workerson 2006-05-24 18:50:19 Warehouse Schema