Are there any options to parallelize queries?

From: Seref Arikan <serefarikan(at)kurumsalteknoloji(dot)com>
To: PG-General Mailing List <pgsql-general(at)postgresql(dot)org>
Subject: Are there any options to parallelize queries?
Date: 2012-08-21 08:45:39
Message-ID: CA+4ThdoofztNGdw8d1g7CBc4XE81fSydekvDV_0-5YTk7A_sog@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Dear all,
I am designing an electronic health record repository which uses postgresql
as its RDMS technology. For those who may find the topic interesting, the
EHR standard I specialize in is openEHR: http://www.openehr.org/

My design makes use of parallel execution in the layers above DB, and it
seems to scale quite good. However, I have a scale problem at hand. A
single patient can have up to 1 million different clinical data entries on
his/her own, after a few decades of usage. Clinicians do love their data,
and especially in chronic disease management, they demand access to
whatever data exists. If you have 20 years of data for a diabetics patient
for example, they'll want to look for trends in that, or even scroll
through all of it, maybe with some filtering.
My requirement is to be able to process those 1 million records as fast as
possible. In case of population queries, we're talking about billions of
records. Each clinical record, (even with all the optimizations our domain
has developed in the last 30 or so years), leads to a number of rows, so
you can see that this is really big data. (imagine a national diabetes
registry with lifetime data of a few million patients)
I am ready to consider Hadoop or other non-transactional approaches for
population queries, but clinical care still requires that I process
millions of records for a single patient.

Parallel software frameworks such as Erlang's OTP or Scala's Akka do help a
lot, but it would be a lot better if I could feed those frameworks with
data faster. So, what options do I have to execute queries in parallel,
assuming a transactional system running on postgresql? For example I'd like
to get last 10 years' records in chunks of 2 years of data, or chunks of 5K
records, fed to N number of parallel processing machines. The clinical
system should keep functioning in the mean time, with new records added etc.
PGPool looks like a good option, but I'd appreciate your input. Any proven
best practices, architectures, products?

Best regards
Seref

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Pavel Stehule 2012-08-21 09:20:50 Re: Are there any options to parallelize queries?
Previous Message Vincent Veyron 2012-08-21 08:18:03 Re: Amazon High I/O instances