Generalized Top Queries on PostgreSQL

From: Roberto Cornacchia <cornacch(at)CS(dot)UniBO(dot)IT>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Paolo Ciaccia <ciaccia(at)CS(dot)UniBO(dot)IT>, Andrea Ghidini <ghidini(at)CS(dot)UniBO(dot)IT>
Subject: Generalized Top Queries on PostgreSQL
Date: 2000-02-19 01:23:38
Message-ID: Pine.GSO.3.96.1000219022048.17025B-100000@adina.cs.unibo.it
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello everyone,

Dr. P. Ciaccia, A. Ghidini and I (R. Cornacchia), have recently
developed, and implemented on PostgreSQL, a class of TOP Queries.

Here I would like to briefly present our main results.

The proposed sintax is the following:

SELECT
FROM
WHERE
STOP AFTER <N>
FOR EACH <stop-grouping attribute list>
RANK BY <ranking specification>
ORDER BY

where both FOR EACH and RANK BY (and STOP AFTER clause as a whole) are
optional.

Here is an example:

----
"Retrieve the 10 highest paid employees for each department.
Order the results first on department name and then on employee name"

SELECT Dept.name, Emp.name, Emp.salary
FROM Emp, Dept
WHERE Emp.dno = Dept.dno
STOP AFTER 10
FOR EACH Dept.dno
RANK BY Emp.salary DESC
ORDER BY Dept.name, Emp.name
----

We called such a query "Generalized Top Query".

The semantics introduced by the example is derived from the one
proposed by Carey and Kossmann ("On Saying 'Enougth Already' in
SQL", 1997). The main points of our extension are:

1) You can obtain the <N> Top rows as dictated by the RANK BY
specification and then produce the results according to the
ORDER BY specification. This means much more flexibility.

2) You can obtain the Top N rows for each of the "groups" which are
individuated by the FOR EACH specification
(from here "Generalized" top query).

Please consider that our semantics completely includes the current
LIMIT clause capabilities, offering at the same time some additional
ones.

For what concerns PostgreSQL, one of our first issues were to keep
basically unchanged
the current framework.

Here is a brief list of major changes we operated:

- We provided 2 new physical operators:
ScanStop: stops the stream to <N> rows
SortStop: performs an ad-hoc sort on the stream,
retaining only the <N> top rows.

- The optimizer can operate a push-down in the path-tree of
those two operators. It means we reduce the
stream cardinality as soon as possible, leading to a great
improvement on the performances of the subsequent operators.

- A larger number of optimizable operators force the optimizer to
generate more plans to handle,
so we extended the operators properties and the pruning rules.

- We extended the cost model as well, introducing estimates for the
cost of producing the FIRST N tuples.

- The rules involved in the Stop operators placement are mainly
based upon referential integrity constraints. In this way we
provided a *temporary* solution to this Postgres lack, storing
the informations on constraints in two new system catalogs.

- The FOR EACH generalization leads to natural generalization
of all the above concepts.

- The evaluation of GROUP BY clause can be performed before and after
the STOP AFTER, and both makes sense. In order to optimize the Stop
After operator, we choose to evaluate it before the group by
(and the order by).

* Let us summarize the current state of our work. *

- Our extension is highly optimized and can lead to performance
improvements of several orders of magnitude in comparison with the
current "LIMIT approach"
- It has a low impact on the original PostgreSQL code
- It does not affect the usual processing of classical queries
- It is soon expected to work with views.
- It works with subqueries
- It works with cursors
- It efficently makes use of indices
- It is updated to a 6.6 snapshot of November 1999 (we are waiting
for a more stable release)

--------------------------------------------------------------------

More details on this subject are going to be presented in a
forthcoming paper, available if interested.

We would be glad to receive your comments on this work.

Best regards,

R. Cornacchia (cornacch(at)cs(dot)unibo(dot)it) Computer Science, University of
Bologna

A. Ghidini (ghidini(at)cs(dot)unibo(dot)it) Computer Science, University of Bologna

Dr. Paolo Ciaccia (ciaccia(at)cs(dot)unibo(dot)it) DEIS CSITE-CNR, University of
Bologna

Browse pgsql-hackers by date

  From Date Subject
Next Message Marc Tardif 2000-02-19 01:26:11 handling multiple file descriptors
Previous Message Bruce Momjian 2000-02-19 00:38:35 Re: [HACKERS] Re: SQL compliance - why -- comments only at psql level ?