Re: General data warehousing questions

From: Shane Ambler <pgsql(at)Sheeky(dot)Biz>
To: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Cc: Sean Davis <sdavis2(at)mail(dot)nih(dot)gov>, pgsql <pgsql-general(at)postgresql(dot)org>
Subject: Re: General data warehousing questions
Date: 2008-10-06 15:39:57
Message-ID: 48EA314D.80609@Sheeky.Biz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Scott Marlowe wrote:
> On Sun, Oct 5, 2008 at 7:48 PM, Sean Davis <sdavis2(at)mail(dot)nih(dot)gov> wrote:
>> I am looking at the prospect of building a data warehouse of genomic
>> sequence data. The machine that produces the data adds about
>> 300million rows per month in a central fact table and we will
>> generally want the data to be "online". We don't need instantaneous
>> queries, but we would be using the data for data mining purposes and
>> running some "real-time" queries for reporting and research purposes.
>> I have had the pleasure of working on an Netezza box where this type
>> of thing is quite standard, but we don't have that access anymore, so
>> I'm looking for hints on using postgres in a data warehousing/mining
>> environment. Any suggestions on how DDL, loading, backup, indexing,
>> or (to a certain extent) hardware?
>
> I assume you're familiar with stuff like star schemas.
>
> For loading you might want to look at things like pg_bulkloader, copy,
>
> For indexing remember that you have partial and fuctional indexes in
> postgresql and they can come in quite handy.
>
> For backup of large changing databases look into PITR.
>
> As for hardware, you need enough CPU horsepower and memory to handle
> however many users you're gonna have running simultaneous queries, but
> more important is usually the drive subsystem. Throwing drives,
> battery backed cache and a good RAID controller can make a big
> difference. Usual RAID-10 is preferred, as writes are much faster. If
> you're really squeezed for space and money then you can use RAID-5 but
> it has some seriously negative performance implications for parallel
> load handling and write speed.
>

You may also be interested in looking at BioPostgres
http://phenomics.cs.ucla.edu/

--

Shane Ambler
pgSQL (at) Sheeky (dot) Biz

Get Sheeky @ http://Sheeky.Biz

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Alvaro Herrera 2008-10-06 15:52:19 Re: Re: [Pkg-postgresql-public] Postgres major version support policy on Debian
Previous Message Markus Wanner 2008-10-06 15:34:13 Re: [Pkg-postgresql-public] Postgres major version support policy on Debian