Re: [PERFORM] not using index for select min(...)

From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PERFORM] not using index for select min(...)
Date: 2003-02-02 20:24:26
Message-ID: 200302021224.26108.josh@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom,

> In the end, the only reasonable way to handle this kind of thing is
> to teach the query planner about it. Considering the small number
> of cases that are usefully optimizable (basically only MIN and MAX
> on a single table without any WHERE or GROUP clauses), and the ready
> availability of a SQL-level workaround, it strikes me as a very
> low-priority TODO item.

Low priority for you, Tom. For some of us, it's one of the three most
high-priority "bugs" in PostgreSQL.

I constantly try to sell my clients, and potential clients, on PostgreSQL.
And the two things that trip me up the most frequently are lack of
replication and our dog-slow aggregates. I can usually sell Postgres on our
strong points, but the aggregate issue is *always* a problem. And the "slow
aggregate" problem comes up about twice a week on Performance and three times
a week on SQL.

Regardless of the technical reason, among MSSQL, Oracle, MySQL and PostgreSQL,
we have the slowest performing simple aggregates. It's very well to explain
this is due to our system of extensible aggregates, but if a potential
Postgres developer doesn't want to create custom aggregates, but does want to
use MIN() in a correlated subquery, then they will go to a different RDBMS.

As I said before, I'm absolutely thrilled that you came up with a solution for
COUNT(*) ... GROUP BY queries through Hash Aggregates. That's half the
picture, now we need a way to speed up MIN() and MAX() for simple one-column
expressions. While there is a "workaround" using ORDER BY & LIMIT, this
doesn't work for correlated subqueries or if one wants to evaluate the result
of MAX() in the query. For example, the following query is not possible to
"workaround" in PostgreSQL:

select teams_desc.team_id, team_name, team_code, notes,
min(teams_tree.treeno) as lnode, max(teams_tree.treeno) as rnode,
parent.team_id as parent_id, count(*)/2 as tlevel
from teams_desc JOIN teams_tree USING (team_id)
join teams_tree parent ON parent.treeno < teams_tree.treeno
join teams_tree parents on parents.treeno < teams_tree.treeno
WHERE parent.treeno = (SELECT max(p1.treeno) from teams_tree p1
where p1.treeno < teams_tree.treeno
and exists (select treeno from teams_tree p2
where p2.treeno > teams_tree.treeno
and p2.team_id = p1.team_id))
AND EXISTS (select parents2.team_id from teams_tree parents2
where parents2.treeno > teams_tree.treeno
AND parents2.team_id = parents.team_id)
group by teams_desc.team_id, team_name, team_code, notes, parent.team_id;

While one would hardly expect the above query to be fast, it is dissapointing
that it takes about 8-10 times as long to execute on PostgreSQL as on MSSQL,
since MSSQL seems to be able to use indexes to evaluate all three MIN() and
MAX() expressions.

Further, assigning such a common query function to a Postgres-specific
workaround hardly upholds our project's dedication to standards. The fact
that we are telling new users to use non-SQL-compliant code to do a query
type present in 90% of databases bothers me every single time I give a newbie
that advice.

It still seems to me that if a query's WHERE expression can be evaluated using
an index, then any related MIN() or MAX() expression should be evaluable
using an index. That is, if you are selecting:
SELECT MAX(team_id) FROM teams WHERE team_id BETWEEN 100 and 200;
... with an index on team_id then this entire query should be able to return
trough an index scan. We've discussed the particular planner problems this
presents for PostgreSQL, but I still believe that these are solvable ... and
moreover, that we *need* to solve them if we're going to be competitive with
other SQL RDBMSes.

I do realize that it's my job to find something to do about this issue since
I'm the one so worked up about it. What I'm concerned about is the
possibility of having any idea or fix I come up with dismissed out of hand
because it's a "low-priority todo". Please add up the questions and
complaints of the users on SQL, NOVICE, and PERFORMANCE ... I know you read
them.

Thanks for reading, Tom.

--
Josh Berkus
Aglio Database Solutions
San Francisco

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Neil Conway 2003-02-02 20:28:15 Re: Last call for 7.3.2
Previous Message Dave Page 2003-02-02 20:22:49 Interactive Documentation - how do you want it to work?