From: | Richard Huxton <dev(at)archonet(dot)com> |
---|---|
To: | Worky Workerson <worky(dot)workerson(at)gmail(dot)com> |
Cc: | pgsql-sql(at)postgresql(dot)org |
Subject: | Re: Warehouse Schema |
Date: | 2006-05-25 09:54:44 |
Message-ID: | 44757EE4.8040807@archonet.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-sql |
Worky Workerson wrote:
> I'm developing a schema for a large data warehouse (10 billion records) and
> had a couple of questions about how to optimize it. My biggest question is
> on the design of the fact table and how much normalization I should
> actually
> do of the data.
As much as you can and still afford the hardware to support it.
> The data is going to be keyed by IP address, stored as a non-unique IP4
> type. Each record is also tagged with various attributes, such as a
> category and a type. Assuming that a category and type are VARCHAR, would
> it make sense to normalize these out of the fact table into their
> respective
> tables and key them by an INTEGER? I.e.
>
> CREATE TABLE big_fact_table_A (
> identifier IP4,
> data1 BYTEA,
> data2 BYTEA,
> ...
> dataN BYTEA,
> category VARCHAR(16),
> type VARCHAR(16)
> );
>
> ... vs ...
>
> CREATE TABLE big_fact_table_B (
> identifier IP4,
> data1 BYTEA,
> data2 BYTEA,
> ...
> dataN BYTEA,
> category INTEGER REFERENCES categories (category_id),
> type INTEGER REFERENCES types (type_id)
> );
Assuming categories and types are fixed sets this isn't really
normalisation, you're just substituting one representation with another.
That's assuming you enforce valid representations on table_A.
Of course, if the categories or types can change then you'll want to use
foreign-keys. Try int2 rather than int4 even (although check the
manuals for details to see if it'll pack them closer).
> I figure that the normalized fact table should be quicker, as the
> integer is
> much smaller than the varchar. On query, however, the table will need
> to be
> joined against the two other tables (categories, types), but I still figure
> that this is a win because the other tables are fairly small and should
> stay
> resident in memory. Is this reasoning valid?
That's the trade-off.
> The downside to this (from my perspective) is that the data comes in the
> form of big_fact_table_A and could be easily COPYed straight into the
> table. with big_fact_table_B it looks like I will have to do the "unJOIN"
> in a script. Also, I have separate installations of the warehouse (with
> different data sources) and it will be difficult to share data between them
> unless their categories/types tables are keyed with exactly the same
> integer
> IDs which, as I don't directly control the other installations, is not
> guaranteed.
>
> Any suggestions to the above "problems"?
If you're comfortable with a little "C" coding, write your own type.
That's really what you're trying to do here. Accept text input and
display text output but store in a compact form.
HTH
--
Richard Huxton
Archonet Ltd
From | Date | Subject | |
---|---|---|---|
Next Message | Alban Hertroys | 2006-05-26 09:25:37 | Re: [SQL] (Ab)Using schemas and inheritance |
Previous Message | Worky Workerson | 2006-05-24 18:50:19 | Warehouse Schema |