From: | Ondrej Ivanič <ondrej(dot)ivanic(at)gmail(dot)com> |
---|---|
To: | Ryan Kelly <rpkelly22(at)gmail(dot)com> |
Cc: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: PostgreSQL, OLAP, and Large Clusters |
Date: | 2012-09-27 22:48:16 |
Message-ID: | CAM6mieKi28B_VQ0UEuobG16njEm6iec+EnwhBq4fXUqvsEwh6Q@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Hi,
On 28 September 2012 04:34, Ryan Kelly <rpkelly22(at)gmail(dot)com> wrote:
> On Thu, Sep 27, 2012 at 08:51:31AM +1000, Ondrej Ivanič wrote:
>> - aggregation job ran every 15 minutes and completed under 2 minutes:
>> 5mil rows -> aggregation -> 56 tables
> 5mil overall, or matching your aggregation query? And is that the 2TB
> mentioned above? We have more than 100 times that many rows, but less
> data.
Let me explain. ETL process imports several thousands row every 5
minutes or so. Aggregation job runs every 15 minutes and it grabs the
everything new since last run which could be up to 5 mil rows. Next
step is to compute aggregates -- 56 queries like insert into
mat_table1 select attr1, attr2, count(*) from tmp_table;
2TB was the size of the live dataset - 6 months, 30-40mil rows per month.
>> - all queries can be executed over date range up to several months
>> (monthly partitioned tables, 6 months history)
> Yeah we have to be able to query over various date ranges.
Partitioning works nicely in this case.
>> - AsterData: nice SQL-MR feature and analytics (decision trees,
>> frequent items, clustering, ...); No libpq support and you have to use
>> JDBC or selected ODBC manager
> I don't think no libpq support is a deal-breaker, but other missing
> features could be problematic.
It was for us -- we ended up with one option which was ODBC and
unixODBC segfaulted on simple queries like 'select 1::numeric'. Aster
removed many PG features (i think arrays and composite types are not
supported) but they added several cool things.
>> - Greenplum (winer): performance comparable to FusionIO (10 to 50
>> times); we were able to remove aggregation job (because of columnar
>> store model); easy to port from postgres but could be complicated if
>> you are heavy pgpsql user
> Not using any pl/pgpsql, but a number of other features: arrays and
> hstore, which I doubt (hopefully wrongly) that Greenplum supports.
Anything which you can compile against 8.2 might work...
>> At this time I would try:
>> - Postgres-XC
> From what I understand, more of a write-scaleable-oriented solution. We
> mostly will need read scalability. I also don't think it really handles
> redundancy.
read scalability is there as well: it can use multiple nodes for
select quires and push-down (execute it on node) certain operations.
Check this talk:
http://www.pgcon.org/2012/schedule/events/424.en.html
redundancy is up to you -- you can deploy as many coordinator nodes as
you need. Data distribution is quite flexible, see DISTRIBUTE BY and
TO GROUP / NODE clauses
(http://postgres-xc.sourceforge.net/docs/1_0/sql-createtable.html)
>> - Stado
> Looks promising, sounded very promising, but it doesn't seem to be
> particularly active or well-documented. It also doesn't support window
> functions (which I could probably get by without) or CTEs (which will be
> trickier, but doable. I'm also not sure of how easy it is to handle node
> failure or adding more nodes, as it appears the number of nodes is
> essentially fixed.
yup, documentations is not the best. You might have a look at
pgpool-II parallel query mode (docs is skimpy, not sure about window
functions and CTEs support)
http://pgpool.projects.postgresql.org/pgpool-II/doc/pgpool-en.html
>> PS. For any commercial solution is better to buy their appliance (if
>> you can afford it...).
> Thanks for the advice. Is it just better supported, or more performant,
> or...?
Usually both. You get support, monitoring, performance. Some of
appliances do dial-home calls hence you get support call back with
concrete advice / solution. Hardware is fine-tuned and proven: good
RAIDs controller, disks, 10GbE interconnects, redundant network /
storage paths. You can build something like that by your self but you
are not going to save in the long run.
--
Ondrej Ivanic
(ondrej(dot)ivanic(at)gmail(dot)com)
From | Date | Subject | |
---|---|---|---|
Next Message | Scott Marlowe | 2012-09-28 00:03:03 | Re: PostgreSQL, OLAP, and Large Clusters |
Previous Message | Leif Biberg Kristensen | 2012-09-27 21:29:04 | Re: problem with recreating database with export |