Re: General data warehousing questions

From: "Scott Marlowe" <scott(dot)marlowe(at)gmail(dot)com>
To: "Sean Davis" <sdavis2(at)mail(dot)nih(dot)gov>
Cc: pgsql <pgsql-general(at)postgresql(dot)org>
Subject: Re: General data warehousing questions
Date: 2008-10-06 02:07:00
Message-ID: dcc563d10810051907l370ddfa4p8d0658a7ef98136a@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Sun, Oct 5, 2008 at 7:48 PM, Sean Davis <sdavis2(at)mail(dot)nih(dot)gov> wrote:
> I am looking at the prospect of building a data warehouse of genomic
> sequence data. The machine that produces the data adds about
> 300million rows per month in a central fact table and we will
> generally want the data to be "online". We don't need instantaneous
> queries, but we would be using the data for data mining purposes and
> running some "real-time" queries for reporting and research purposes.
> I have had the pleasure of working on an Netezza box where this type
> of thing is quite standard, but we don't have that access anymore, so
> I'm looking for hints on using postgres in a data warehousing/mining
> environment. Any suggestions on how DDL, loading, backup, indexing,
> or (to a certain extent) hardware?

I assume you're familiar with stuff like star schemas.

For loading you might want to look at things like pg_bulkloader, copy,

For indexing remember that you have partial and fuctional indexes in
postgresql and they can come in quite handy.

For backup of large changing databases look into PITR.

As for hardware, you need enough CPU horsepower and memory to handle
however many users you're gonna have running simultaneous queries, but
more important is usually the drive subsystem. Throwing drives,
battery backed cache and a good RAID controller can make a big
difference. Usual RAID-10 is preferred, as writes are much faster. If
you're really squeezed for space and money then you can use RAID-5 but
it has some seriously negative performance implications for parallel
load handling and write speed.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Alain Roger 2008-10-06 08:07:31 restore a dump db from tar file
Previous Message Sean Davis 2008-10-06 01:48:15 General data warehousing questions