Re: Postgres for a "data warehouse", 5-10 TB

From: Igor Chudov <ichudov(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Postgres for a "data warehouse", 5-10 TB
Date: 2011-09-11 23:16:36
Message-ID: CAMhtkAazH6wCduE23307L+bxLZf9Y+WyAoFTYGAvnZ+n4RkStQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Sun, Sep 11, 2011 at 6:01 PM, Stephen Frost <sfrost(at)snowman(dot)net> wrote:

> * Igor Chudov (ichudov(at)gmail(dot)com) wrote:
> > Right now I have a personal (one user) project to create a 5-10
> > Terabyte data warehouse. The largest table will consume the most space
> > and will take, perhaps, 200,000,000 rows.
>
> I run data-warehouse databases on that order (current largest single
> instance is ~4TB running under 9.0.4). If the largest table is only
> 200M rows, PG should handle that quite well. Our data is partitioned by
> month and each month is about 200M records and simple queries can run in
> 15-20 minutes (with a single thread), with complex windowing queries
> (split up and run in parallel) finishing in a couple of hours.
>
>

Which brings up a question.

Can I partition data by month (or quarter), without that month being part of
PRIMARY KEY?

If this question sounds weird, I am asking because MySQL enforces this,
which does not fit my data.

If I can keep my primary key to be the ID that I want (which comes with
data), but still partition it by month, I will be EXTREMELY happy.

> However, while an hour is fine, two weeks per query is NOT fine.
>
> What's really, really, really useful are two things: EXPLAIN, and this
> mailing list. :) Seriously, run EXPLAIN on your queries before you run
> them and see if how the query is going to be executed makes sense.
> Here's a real easy hint: if it says "External Sort" and has big numbers,
> come talk to us here- that's about one of the worst things you can
> possibly do. Of course, PG's going to avoid doing that, but you may
> have written a query (unintentionally) which forces PG to do a sort, or
> something else.
>
>
Very good, thanks

> > I have a server with about 18 TB of storage and 48 GB of RAM, and 12
> > CPU cores.
>
> If you partition up your data and don't mind things running in different
> transactions, you can definitely get a speed boost with PG by running
> things in parallel. PG will handle that very well, in fact, if two
> queries are running against the same table, PG will actually combine
> them and only actually read the data from disk once.
>
> > I cannot shell out $47,000 per CPU for Oracle for this project.
>
> The above data warehouse was migrated from an Oracle-based system. :)
>
>
I am wondering, why?

> > To be more specific, the batch queries that I would do, I hope,
> > would either use small JOINS of a small dataset to a large dataset, or
> > just SELECTS from one big table.
>
> Make sure that you set your 'work_mem' correctly- PG will use that to
> figure out if it can hash the small table (you want that to happen,
> trust me..). If you do end up having sorts, it'll also use the work_mem
> value to figure out how much memory to use for sorting.
>
>
I could, say, set work_mem to 30 GB? (64 bit linux)

> > So... Can Postgres support a 5-10 TB database with the use pattern
> > stated above?
>
> Yes, certainly.
>
>
that's great to know.

i

> Thanks,
>
> Stephen
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.10 (GNU/Linux)
>
> iEYEARECAAYFAk5tPc8ACgkQrzgMPqB3kigtSgCffwEmi3AD6Ryff7qZyQYieyKQ
> jhoAoJDFC1snQmwCIBUjwlC6WVRyAOkn
> =LPtP
> -----END PGP SIGNATURE-----
>
>

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tomas Vondra 2011-09-11 23:17:56 Re: RAID Controller (HP P400) beat by SW-RAID?
Previous Message Stephen Frost 2011-09-11 23:01:35 Re: Postgres for a "data warehouse", 5-10 TB