From: | Merlin Moncure <mmoncure(at)gmail(dot)com> |
---|---|
To: | Dave Crooke <dcrooke(at)gmail(dot)com> |
Cc: | James Cranch <jdc41(at)cam(dot)ac(dot)uk>, pgsql-performance(at)postgresql(dot)org |
Subject: | Re: Rapidly finding maximal rows |
Date: | 2011-10-19 14:54:56 |
Message-ID: | CAHyXU0w9qAJ=x99UEErdaHPm=g_QLrESZHGQrq5Tx2T9QqRLbA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
On Tue, Oct 11, 2011 at 8:05 PM, Dave Crooke <dcrooke(at)gmail(dot)com> wrote:
> Hi James
>
>
> I'm guessing the problem is that the combination of using a view and the way
> the view is defined with an in-line temporary table is too complex for the
> planner to introspect into, transform and figure out the equivalent direct
> query, and so it's creating that entire temporary table every time you
> evaluate the select.
>
> Our app has some similar queries (get me the most recent row from a data
> logging table) and these work fine with a simple self-join, like this
> example (irrelevant columns omitted for discussion)
>
> select t.entity, t.time_stamp, t.data from log_table t
> where t.entity=cast('21EC2020-3AEA-1069-A2DD-08002B30309D' as uuid)
> and t.time_stamp=
> (select max(time_stamp)
> from log_table u
> where t.entity=u.entity)
>
> given a schema with the obvious indexes ...
>
> create table log_table
> (entity UUID,
> time_stamp TIMESTAMP WITHOUT TIME ZONE,
> data TEXT);
>
> create index log_table_index on log_table (entity, time_stamp);
>
> .. and the plan for the dependent sub-query does the obvious reverse index
> scan as you'd expect / want.
>
>
>
> If you still want / need to have the view, I suspect that getting rid of the
> temp table definition will fix it ... my effort is below, alternatively you
> might be able to take your first example and pull out best_scores and define
> it as a view alos,
>
> CREATE VIEW best_in_school_method3 AS
> SELECT competition_name, academic_year_beginning, centre_number, entry_id,
> total_score, (true) AS best_in_school FROM challenge_entries ce1
> WHERE total_score =
> (SELECT MAX(total_score) FROM challenge_entries ce2
> WHERE ce1.competition_name=ce2.competition_name
> AND ce1.academic_year_beginning=ce2.academic_year_beginning
> AND ce1.centre_number=ce2.centre_number
> )
This is a very common problem in SQL and has a lot of interesting solutions.
In queries like this I usually use the 'ORDER BY total_score DESC
LIMIT 1 trick. Modern postgres is *usually* smart enough to convert
max to that, but not always.
WHERE total_score =
(SELECT total_score FROM challenge_entries ce2
WHERE ce1.competition_name=ce2.competition_name
AND ce1.academic_year_beginning=ce2.academic_year_beginning
AND ce1.centre_number=ce2.centre_number
ORDER BY total_score DESC LIMIT 1
)
Another clever, and more portable way to write it which can sometimes
give a better plan is like this:
WHERE NOT EXISTS
(
SELECT 1 FROM challenge_entries ce2
WHERE ce1.competition_name=ce2.competition_name
AND ce1.academic_year_beginning=ce2.academic_year_beginning
AND ce1.centre_number=ce2.centre_number
AND ce1.total_score > ce2.total_score
)
Yet another interesting way of getting the 'top' record based on a
certain criteria is to write a custom aggregate. In postgres you can
aggregate over the entire record type, not just plain fields, so that
running your aggregate function looks like this:
SELECT competition_name, academic_year_beginning, centre_number,
max_challenge_entries(ce) FROM challenge_entries ce GROUP BY 1,2,3;
Your function aggregator in SQL would look something like this:
CREATE OR REPLACE FUNCTION
max_challenge_entries_impl(challenge_entries, challenge_entries)
returns challenge_entries AS
$$
SELECT CASE WHEN ($2).total_score > ($1).total_score THEN $2 ELSE $1 END;
$$ LANGUAGE SQL IMMUTABLE;
This very STLish approach is rarely the best way to go performance
wise although it can give better worst case plans in some cases
(although total_score if in index can never be used for optimization).
I mention it because I find it to be very clean conceptually and can
be a great approach if your 'picking' algorithm is sufficiently more
complex than 'field > field' and is also otherwise not optimizable.
Anyways, food for thought.
merlin
From | Date | Subject | |
---|---|---|---|
Next Message | d.davolio@mastertraining.it | 2011-10-19 15:02:54 | Re: How many Cluster database on a single server |
Previous Message | Craig James | 2011-10-19 13:54:30 | Re: How many Cluster database on a single server |