Support for distributed queries with semijoins could be possible, if ...

From: Gunther Schadow <gunther(at)aurora(dot)regenstrief(dot)org>
To: pgsql-sql <pgsql-sql(at)postgresql(dot)org>
Subject: Support for distributed queries with semijoins could be possible, if ...
Date: 2002-06-09 17:43:44
Message-ID: 3D0393D0.2070208@aurora.regenstrief.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

Hi again pgsql gurus,

if I want to do a distributed join query, I would like to have a
way for streaming as much of it as possible. I think there is a
modest addition to pgsql that could do that, what do you think?

The principle is similar to the COPY FROM STDIN form, just that
we would not INSERT that data but do an inner join on data piped
to the engine from STDIN.

Here is an example.

CREATE TABLE Observation (
id OID NOT NULL PRIMARY KEY,
patient_id OID NOT NULL REFERENCES Patient(id);
type_code VARCHAR NOT NULL;
value DOUBLE;
);

and the assumption is that this table is horizontally partitioned
over different systems, each covering different types of
observations (partitioned on type_code clusters.) But all
would use the same set of patient ids.

Consider we want to query for patients with fever over 100 and
white blood cell count under 1000. Without horizontal partitioning,
we could write the following query (I am choosing this particular
form to prepare the reader for the distributed execution.)

SELECT fever.patient_id AS patient_id,
fever.value AS fever,
wbc.value AS wbc,
FROM (SELECT patient_id, value
FROM Observation fever
WHERE fever.type_code = 'FEVER'
AND fever.value > 100) fever
INNER JOIN Observation wbc
ON fever.patient_id = wbc.patient_id
WHERE wbc.type_code = 'WHITE BLOOD CELL COUNT'
AND wbc.value < 1000;

Now, consider that fever observations are on system A and white
blood cell count observations are on system B, while A and B
share the same set of patient ids. We could then distribute as
follows:

to system A:

SELECT fever.patient_id AS patient_id,
fever.value AS fever
FROM Observation fever
WHERE fever.type_code = 'FEVER'
AND fever.value > 100;

then we use a cursor on that query such that we can get results
as early as posible for streaming those results into the next
query to system B:

SELECT fever.patient_id AS patient_id,
fever.value AS fever,
wbc.value AS wbc
FROM (SELECT patient_id, value
FROM STDIN) fever
INNER JOIN Observation wbc
ON fever.patient_id = wbc.patient_id
WHERE wbc.type_code = 'WHITE BLOOD CELL COUNT'
AND wbc.value < 1000;

Notice the "SELECT ... FROM STDIN" form, which is reminescent of
the "COPY ... FORM STDIN". Presumably on the interface one could
use the same syntax for tuples as for the COPY FORM (e.g., DELIMITES
AS '|' WITH NULL AS '') and then ship the tuples line by line ending
with a \.

How hard would it be to add such a feature?

On the JDBC interface, I presume that a useful form would be to
somehow pipe the ResultSet from the first query into the second
query, but I don't know if JDBC provides such a mechanism.

I think this could be very useful for the time until PostgreSQL
can internally deal with horizontal partitioning over distributed
data base nodes ;-).

thanks,
-Gunther

--
Gunther Schadow, M.D., Ph.D. gschadow(at)regenstrief(dot)org
Medical Information Scientist Regenstrief Institute for Health Care
Adjunct Assistant Professor Indiana University School of Medicine
tel:1(317)630-7960 http://aurora.regenstrief.org

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Gunther Schadow 2002-06-09 18:10:52 Re: select failure
Previous Message Gunther Schadow 2002-06-09 17:07:35 Q: will GROUP BY make use of an index to return tuples early?