Quick Links

Re: Postgres for a "data warehouse", 5-10 TB

From:	Marti Raudsepp <marti(at)juffo(dot)org>
To:	Andy Colson <andy(at)squeakycode(dot)net>
Cc:	Igor Chudov <ichudov(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: Postgres for a "data warehouse", 5-10 TB
Date:	2011-09-11 17:02:06
Message-ID:	CABRT9RB-etP-K-0u0y-Wv+BysSd9ToZm71BqcY7hFQZC2tmjYQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

On Sun, Sep 11, 2011 at 17:23, Andy Colson <andy(at)squeakycode(dot)net> wrote:
> On 09/11/2011 08:59 AM, Igor Chudov wrote:
>> By the way, does that INSERT UPDATE functionality or something like this exist in Postgres?
> You have two options:
> 1) write a function like:
> create function doinsert(_id integer, _value text) returns void as
> 2) use two sql statements:

Unfortunately both of these options have caveats. Depending on your
I/O speed, you might need to use multiple loader threads to saturate
the write bandwidth.

However, neither option is safe from race conditions. If you need to
load data from multiple threads at the same time, they won't see each
other's inserts (until commit) and thus cause unique violations. If
you could somehow partition their operation by some key, so threads
are guaranteed not to conflict each other, then that would be perfect.
The 2nd option given by Andy is probably faster.

You *could* code a race-condition-safe function, but that would be a
no-go on a data warehouse, since each call needs a separate
subtransaction which involves allocating a transaction ID.

----

Which brings me to another important point: don't do lots of small
write transactions, SAVEPOINTs or PL/pgSQL subtransactions. Besides
being inefficient, they introduce a big maintenance burden. In
PostgreSQL's MVCC, each tuple contains a reference to the 32-bit
transaction ID that inserted it (xmin). After hitting the maximum
32-bit value transaction ID, the number "wraps around". To prevent old
rows from appearing as new, a "vacuum freeze" process will run after
passing autovacuum_freeze_max_age transactions (200 million by
default) to update all old rows in your database. Using fewer
transaction IDs means it runs less often.

On small databases, this is usually not important. But on a 10TB data
warehouse, rewriting a large part of your database totally kills
performance for any other processes.
This is detailed in the documentation:
http://www.postgresql.org/docs/9.1/static/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND

Regards,
Marti

In response to

Re: Postgres for a "data warehouse", 5-10 TB at 2011-09-11 14:23:20 from Andy Colson

Responses

Re: Postgres for a "data warehouse", 5-10 TB at 2011-09-12 16:22:55 from Shaun Thomas
Re: Postgres for a "data warehouse", 5-10 TB at 2011-09-12 17:15:35 from Robert Klemme

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Ogden	2011-09-11 18:36:56	Re: Postgres for a "data warehouse", 5-10 TB
Previous Message	Claudio Freire	2011-09-11 14:44:18	Re: Postgres for a "data warehouse", 5-10 TB