Decision support query inefficiencies ...

From: Gunther Schadow <gunther(at)aurora(dot)regenstrief(dot)org>
To: pgsql-sql(at)postgresql(dot)org
Subject: Decision support query inefficiencies ...
Date: 2002-07-29 20:00:16
Message-ID: 3D459ED0.3030000@aurora.regenstrief.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

Hi,

this is a more general question than about specific PostgreSQL
implementation, but may be PostgreSQL has some potential specific
answer, and hopefully you can give me some general hints.

We have a large data base of, let's call them "Acts" (in our case
it is clinical observations, but could just as well be purchase orders
or account transactions, etc.)

CREATE TABLE Act (
id OID NOT NULL PRIMARY KEY,
subject_id OID NOT NULL FOREIGN KEY REFERENCES Subject(id),
type_code VARCHAR NOT NULL FOREIGN KEY REFERENCES Act_type(code),
time TIMESTAMP NOT NULL,
/* detail ... */
);

CREATE INDEX ON Act(subject_id, type_code, time);

Now let's have 200 million of these acts on 2 million subjects in our
test database. The production version will have about 100 times that
amount.

A very frequent decision support query we have to make is to get the
last act of a certain type that occurred before a certain cut-off
date. Worse, yet, the cutoff date is quite likely even a correlated
subquery, but that's for later. For now, I'm always struck by the
difficulty of asking the simple quesion of the most recent act of
a certain type before a cutoff time:

SELECT Act.*
FROM (SELECT subject_id, type_code, MAX(time) AS time
FROM Act
WHERE Act.type_code = $ACT_TYPE
AND Act.time < $CUTOFF_TIME
GROUP BY subject_id, type_code) last
INNER JOIN
Act
ON (Act.subject_id, Act.type_code, Act.time)
=(last.subject_id, last.type_code, last.time);

I'm not so worried about the complexity of writing this query but
how inefficient it appears to be.

1. select the Acts of the specific type code
2. sort by subject_id, type_code and time
3. join the result again with the Act table, selecting the items
again via random access

Wouldn't it be nice to somehow retain the sort that happened for the
GROUP BY operation, and just read the last item of each group with
same group-by key?

Can PostgreSQL do something like that? Is my desire for some
optimization flawed? Am I overlooking an existing, well-known
optimization one can do with plain SQL?

I would greatly appreciate your advice or ideas
-Gunther

PS: The a materialized view or summary table would not help in this
case because we're asking for the last value before a certain
cutoff time.

--
Gunther Schadow, M.D., Ph.D. gschadow(at)regenstrief(dot)org
Medical Information Scientist Regenstrief Institute for Health Care
Adjunct Assistant Professor Indiana University School of Medicine
tel:1(317)630-7960 http://aurora.regenstrief.org

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Ken Corey 2002-07-29 20:04:30 Re: Returning PK of first insert for second insert use.
Previous Message Peter Atkins 2002-07-29 19:52:01 Re: Returning PK of first insert for second insert use.