Re: Postgres for a "data warehouse", 5-10 TB

From: Andy Colson <andy(at)squeakycode(dot)net>
To: Igor Chudov <ichudov(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Postgres for a "data warehouse", 5-10 TB
Date: 2011-09-11 14:16:27
Message-ID: 4E6CC2BB.1060704@squeakycode.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On 09/11/2011 07:35 AM, Igor Chudov wrote:
> I have been a MySQL user for years, including owning a few
> multi-gigabyte databases for my websites, and using it to host
> algebra.com <http://algebra.com> (about 12 GB database).
>
> I have had my ups and downs with MySQL. The ups were ease of use and
> decent performance for small databases such as algebra.com <http://algebra.com>. The downs
> were things like twenty hour REPAIR TABLE operations on a 35 GB
> table, etc.
>
> Right now I have a personal (one user) project to create a 5-10
> Terabyte data warehouse. The largest table will consume the most space
> and will take, perhaps, 200,000,000 rows.
>
> I want to use it to obtain valuable business intelligence and to make
> money.
>
> I expect it to grow, never shrink, and to be accessed via batch
> queries. I do not care for batch queries to be super fast, for example
> an hour per query would be just fine.
>
> However, while an hour is fine, two weeks per query is NOT fine.
>
> I have a server with about 18 TB of storage and 48 GB of RAM, and 12
> CPU cores.
>
> My initial plan was to use MySQL, InnoDB, and deal with problems as
> they arise. Perhaps, say, I would implement my own joining
> procedures.
>
> After reading some disparaging stuff about InnoDB performance on large
> datasets, however, I am getting cold feet. I have a general feeling
> that, perhaps, I will not be able to succeed with MySQL, or, perhaps,
> with either MySQL and Postgres.
>
> I do not know much about Postgres, but I am very eager to learn and
> see if I can use it for my purposes more effectively than MySQL.
>
> I cannot shell out $47,000 per CPU for Oracle for this project.
>
> To be more specific, the batch queries that I would do, I hope,
> would either use small JOINS of a small dataset to a large dataset, or
> just SELECTS from one big table.
>
> So... Can Postgres support a 5-10 TB database with the use pattern
> stated above?
>
> Thanks!
>
> i
>

That is a scale or two larger than I have experience with. I converted my website database from mysql to PG, and it has several db's between 1 and 10 gig. There are parts of the website that were faster with mysql, and there are parts faster with PG. One spot, because PG has superior join support on select statements, I was able to change the code to generate a single more complicated sql statement vs. mysql that had to fire off several simpler statements. Its a search screen where you can type in 15'ish different options. I was able to generate a single sql statement which joins 8 some odd tables and plenty of where statements. PG runs it in the blink of an eye. Its astonishing compared to the pain of mysql. If you ever have to write your own join, or your own lookup function, that's a failure of your database.

One spot that was slower was a batch insert of data. Its not so much slower that it was a problem. I use COPY on PG vs prepared insert's on mysql. It was pretty close, but mysql still won.

Seeing as you can setup and test both databases, have you considered a trial run?

Things to watch for:

I think the same amount of data will use more disk space in PG than in mysql.

Importing data into PG should use COPY and multiple connections at the same time.

PG will only use multi-core if you use multiple connections. (each connecion uses one core).

Huge result sets (like a select statement that returns 1,000,000 rows) will be slow.

PG is a much fuller database than mysql, and as such you can influence its join types, and function calls. (table scan vs index, immutable function vs stable, perl function vs sql). So if at first it appears slow, you have a million options. I think the only option you have in mysql is to pull the data back and code it yourself.

Upgrading to major versions of PG may or may not be painful. (mysql sometimes works seamlessly between versions, it appears brilliant. But I have had problems with an update, and when it goes bad, you dont have a lot of options). In the past PG's only method of upgrade was a full backup of old, restore in new. Things have gotten better, there is new pg_upgrade support (still kinda new though), and there is some 3rd party replication support where you replicate your 9.0 database to a new 9.1 database, and at some point you promote the new 9.1 database as the new master. Or something like that. I've only read posts about it, never done it. But with that much data, you'll need an upgrade plan.

All in all, if I can summarize my personal view: mysql is fast at the expense of safety and usability. (mysql still cannot do update's with subselects). PG is safe and usable at the expense of speed, and you wont be disappointed by the speed.

-Andy

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Igor Chudov 2011-09-11 14:21:35 Re: Postgres for a "data warehouse", 5-10 TB
Previous Message Claudio Freire 2011-09-11 14:16:21 Re: Postgres for a "data warehouse", 5-10 TB