Re: PostgreSQL, OLAP, and Large Clusters

From: Ondrej Ivanič <ondrej(dot)ivanic(at)gmail(dot)com>
To: Ryan Kelly <rpkelly22(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: PostgreSQL, OLAP, and Large Clusters
Date: 2012-09-27 22:48:16
Message-ID: CAM6mieKi28B_VQ0UEuobG16njEm6iec+EnwhBq4fXUqvsEwh6Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi,

On 28 September 2012 04:34, Ryan Kelly <rpkelly22(at)gmail(dot)com> wrote:
> On Thu, Sep 27, 2012 at 08:51:31AM +1000, Ondrej Ivanič wrote:
>> - aggregation job ran every 15 minutes and completed under 2 minutes:
>> 5mil rows -> aggregation -> 56 tables
> 5mil overall, or matching your aggregation query? And is that the 2TB
> mentioned above? We have more than 100 times that many rows, but less
> data.

Let me explain. ETL process imports several thousands row every 5
minutes or so. Aggregation job runs every 15 minutes and it grabs the
everything new since last run which could be up to 5 mil rows. Next
step is to compute aggregates -- 56 queries like insert into
mat_table1 select attr1, attr2, count(*) from tmp_table;

2TB was the size of the live dataset - 6 months, 30-40mil rows per month.

>> - all queries can be executed over date range up to several months
>> (monthly partitioned tables, 6 months history)
> Yeah we have to be able to query over various date ranges.

Partitioning works nicely in this case.

>> - AsterData: nice SQL-MR feature and analytics (decision trees,
>> frequent items, clustering, ...); No libpq support and you have to use
>> JDBC or selected ODBC manager
> I don't think no libpq support is a deal-breaker, but other missing
> features could be problematic.

It was for us -- we ended up with one option which was ODBC and
unixODBC segfaulted on simple queries like 'select 1::numeric'. Aster
removed many PG features (i think arrays and composite types are not
supported) but they added several cool things.

>> - Greenplum (winer): performance comparable to FusionIO (10 to 50
>> times); we were able to remove aggregation job (because of columnar
>> store model); easy to port from postgres but could be complicated if
>> you are heavy pgpsql user
> Not using any pl/pgpsql, but a number of other features: arrays and
> hstore, which I doubt (hopefully wrongly) that Greenplum supports.

Anything which you can compile against 8.2 might work...

>> At this time I would try:
>> - Postgres-XC
> From what I understand, more of a write-scaleable-oriented solution. We
> mostly will need read scalability. I also don't think it really handles
> redundancy.

read scalability is there as well: it can use multiple nodes for
select quires and push-down (execute it on node) certain operations.
Check this talk:
http://www.pgcon.org/2012/schedule/events/424.en.html

redundancy is up to you -- you can deploy as many coordinator nodes as
you need. Data distribution is quite flexible, see DISTRIBUTE BY and
TO GROUP / NODE clauses
(http://postgres-xc.sourceforge.net/docs/1_0/sql-createtable.html)

>> - Stado
> Looks promising, sounded very promising, but it doesn't seem to be
> particularly active or well-documented. It also doesn't support window
> functions (which I could probably get by without) or CTEs (which will be
> trickier, but doable. I'm also not sure of how easy it is to handle node
> failure or adding more nodes, as it appears the number of nodes is
> essentially fixed.

yup, documentations is not the best. You might have a look at
pgpool-II parallel query mode (docs is skimpy, not sure about window
functions and CTEs support)
http://pgpool.projects.postgresql.org/pgpool-II/doc/pgpool-en.html

>> PS. For any commercial solution is better to buy their appliance (if
>> you can afford it...).
> Thanks for the advice. Is it just better supported, or more performant,
> or...?

Usually both. You get support, monitoring, performance. Some of
appliances do dial-home calls hence you get support call back with
concrete advice / solution. Hardware is fine-tuned and proven: good
RAIDs controller, disks, 10GbE interconnects, redundant network /
storage paths. You can build something like that by your self but you
are not going to save in the long run.

--
Ondrej Ivanic
(ondrej(dot)ivanic(at)gmail(dot)com)

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Scott Marlowe 2012-09-28 00:03:03 Re: PostgreSQL, OLAP, and Large Clusters
Previous Message Leif Biberg Kristensen 2012-09-27 21:29:04 Re: problem with recreating database with export