From: | Shane Ambler <pgsql(at)Sheeky(dot)Biz> |
---|---|
To: | Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com> |
Cc: | Sean Davis <sdavis2(at)mail(dot)nih(dot)gov>, pgsql <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: General data warehousing questions |
Date: | 2008-10-06 15:39:57 |
Message-ID: | 48EA314D.80609@Sheeky.Biz |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Scott Marlowe wrote:
> On Sun, Oct 5, 2008 at 7:48 PM, Sean Davis <sdavis2(at)mail(dot)nih(dot)gov> wrote:
>> I am looking at the prospect of building a data warehouse of genomic
>> sequence data. The machine that produces the data adds about
>> 300million rows per month in a central fact table and we will
>> generally want the data to be "online". We don't need instantaneous
>> queries, but we would be using the data for data mining purposes and
>> running some "real-time" queries for reporting and research purposes.
>> I have had the pleasure of working on an Netezza box where this type
>> of thing is quite standard, but we don't have that access anymore, so
>> I'm looking for hints on using postgres in a data warehousing/mining
>> environment. Any suggestions on how DDL, loading, backup, indexing,
>> or (to a certain extent) hardware?
>
> I assume you're familiar with stuff like star schemas.
>
> For loading you might want to look at things like pg_bulkloader, copy,
>
> For indexing remember that you have partial and fuctional indexes in
> postgresql and they can come in quite handy.
>
> For backup of large changing databases look into PITR.
>
> As for hardware, you need enough CPU horsepower and memory to handle
> however many users you're gonna have running simultaneous queries, but
> more important is usually the drive subsystem. Throwing drives,
> battery backed cache and a good RAID controller can make a big
> difference. Usual RAID-10 is preferred, as writes are much faster. If
> you're really squeezed for space and money then you can use RAID-5 but
> it has some seriously negative performance implications for parallel
> load handling and write speed.
>
You may also be interested in looking at BioPostgres
http://phenomics.cs.ucla.edu/
--
Shane Ambler
pgSQL (at) Sheeky (dot) Biz
Get Sheeky @ http://Sheeky.Biz
From | Date | Subject | |
---|---|---|---|
Next Message | Alvaro Herrera | 2008-10-06 15:52:19 | Re: Re: [Pkg-postgresql-public] Postgres major version support policy on Debian |
Previous Message | Markus Wanner | 2008-10-06 15:34:13 | Re: [Pkg-postgresql-public] Postgres major version support policy on Debian |