Re: the best way to get the first record from each group

From: PFC <lists(at)boutiquenumerique(dot)com>
To: q2005 <q2005(at)tpg(dot)com(dot)au>, pgsql-sql(at)postgresql(dot)org
Subject: Re: the best way to get the first record from each group
Date: 2005-02-08 02:20:21
Message-ID: opsluwr7fpth1vuj@musicbox
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql


I don't really gr0k your field names so I'll use an easier example :

CREATE TABLE groups ( group_id SERIAL PRIMARY KEY, group_name TEXT NULL )
WITHOUT OIDS;

CREATE TABLE people
( user_id SERIAL PRIMARY KEY,
group_id INTEGER NOT NULL REFERENCES groups(group_id),
score INTEGER NOT NULL )
WITHOUT OIDS;

CREATE INDEX people_scored ON people( group_id, score );

... put 1K rows into groups, vacuum analyze
... put 128K rows into people, vacuum analyze

So you want the user in each group with the highest score (or the lowest
subno... thats the same).

0-- DISTINCT ON
----------------------------------------------------------------------------------------
SELECT DISTINCT ON (group_id) group_id, user_id, score FROM people ORDER
BY group_id, score;
Unique (cost=0.00..4968.17 rows=996 width=12) (actual
time=0.144..539.667 rows=1000 loops=1)
-> Index Scan using people_scored on people (cost=0.00..4640.49
rows=131072 width=12) (actual time=0.141..454.893 rows=131072 loops=1)
Total runtime: 540.212 ms
----------------------------------------------------------------------------------------

It works but is about the slowest thing imaginable : index-scanning (or
sorting) the entire table.
DISTINCT ON can be very convenient nonetheless !

seq scan => disqualified.

1-- min(), max()

max() will give you the score but it won't give you the user_id so you
have to resort to a trick just like you did.
And it does a seq scan => disqualified.

2-- custom aggregate

You could possibly write an aggregate which takes a row from people as an
argument, or an ARRAY[score,user_id] and which acts like max and returns
the ARRAY[score,user_id] with the highest score so you can have its
user_id. As array comparison works as expected in pgsql, you could use
max(), unfortunately for you, max() does not work on integer arrays (Why
is that so ?) so this solution needs you to write a custom aggregate.

Note that this would still need a seq scan, and chould be slower than the
DISTINCT => disqualified.

2-- subquery

The problem is that you'll need a list of your groups. If you don't have
a table, you'll have to extract them from the table people with a (SELECT
group_id FROM people GROUP BY group_id) which is a sequential scan. So
I'll presume there is a groups table, which is why I put a 'REFERENCES
groups(group_id)' in the table declaration above.

To get the best score in a group of id GID we write :

----------------------------------------------------------------------------------------
SELECT user_id FROM people WHERE group_id=5 ORDER BY group_id DESC, score
DESC LIMIT 1;

Limit (cost=0.00..3.69 rows=1 width=12) (actual time=0.054..0.055 rows=1
loops=1)
-> Index Scan Backward using people_scored on people
(cost=0.00..480.02 rows=130 width=12) (actual time=0.051..0.051 rows=1
loops=1)
Index Cond: (group_id = 5)
Total runtime: 0.143 ms
----------------------------------------------------------------------------------------

To get the best scores for all groups we apply this SELECT to all groups.
You see now why we need a groups table to precalculate the groups.

----------------------------------------------------------------------------------------
SELECT g.group_id, (SELECT user_id FROM people WHERE group_id=g.group_id
ORDER BY group_id DESC, score DESC LIMIT 1) as user_id FROM groups g;

Seq Scan on groups g (cost=0.00..3702.48 rows=1000 width=4) (actual
time=0.079..18.942 rows=1000 loops=1)
SubPlan
-> Limit (cost=0.00..3.69 rows=1 width=12) (actual
time=0.014..0.014 rows=1 loops=1000)
-> Index Scan Backward using people_scored on people
(cost=0.00..486.75 rows=132 width=12) (actual time=0.011..0.011 rows=1
loops=1000)
Index Cond: (group_id = $0)
Total runtime: 19.475 ms
----------------------------------------------------------------------------------------

Note that the subselect here can only yield ONE column so another join
comes in to get the score :

-- Take 1
----------------------------------------------------------------------------------------
SELECT * FROM people WHERE user_id IN (SELECT (SELECT user_id FROM people
WHERE group_id=g.group_id ORDER BY group_id DESC, score DESC LIMIT 1) as
user_id FROM groups g);

Nested Loop (cost=21.19..10418.45 rows=1000 width=12) (actual
time=29.851..87.289 rows=1000 loops=1)
-> HashAggregate (cost=17.50..3704.98 rows=1000 width=4) (actual
time=29.789..32.174 rows=1000 loops=1)
-> Seq Scan on groups g (cost=0.00..15.00 rows=1000 width=4)
(actual time=0.119..27.982 rows=1000 loops=1)
SubPlan
-> Limit (cost=0.00..3.69 rows=1 width=12) (actual
time=0.023..0.023 rows=1 loops=1000)
-> Index Scan Backward using people_scored on
people (cost=0.00..486.75 rows=132 width=12) (actual time=0.020..0.020
rows=1 loops=1000)
Index Cond: (group_id = $0)
-> Index Scan using people_pkey on people (cost=3.69..6.70 rows=1
width=12) (actual time=0.020..0.022 rows=1 loops=1000)
Index Cond: (people.user_id = (subplan))
SubPlan
-> Limit (cost=0.00..3.69 rows=1 width=12) (actual
time=0.027..0.027 rows=1 loops=1000)
-> Index Scan Backward using people_scored on people
(cost=0.00..486.75 rows=132 width=12) (actual time=0.024..0.024 rows=1
loops=1000)
Index Cond: (group_id = $0)
-> Limit (cost=0.00..3.69 rows=1 width=12) (never executed)
-> Index Scan Backward using people_scored on people
(cost=0.00..486.75 rows=132 width=12) (never executed)
Index Cond: (group_id = $0)
Total runtime: 88.039 ms
----------------------------------------------------------------------------------------

-- Take 2
----------------------------------------------------------------------------------------
SELECT p.group_id, p.user_id, p.score FROM people p, (SELECT g.group_id,
(SELECT user_id FROM people WHERE group_id=g.group_id ORDER BY group_id
DESC, score DESC LIMIT 1) as user_id FROM groups g) as top WHERE p.user_id
= top.user_id;

Nested Loop (cost=3.69..10415.95 rows=1000 width=12) (actual
time=0.200..57.387 rows=1000 loops=1)
-> Seq Scan on groups g (cost=0.00..15.00 rows=1000 width=4) (actual
time=0.051..1.556 rows=1000 loops=1)
-> Index Scan using people_pkey on people p (cost=3.69..6.70 rows=1
width=12) (actual time=0.015..0.016 rows=1 loops=1000)
Index Cond: (p.user_id = (subplan))
SubPlan
-> Limit (cost=0.00..3.69 rows=1 width=12) (actual
time=0.032..0.032 rows=1 loops=1000)
-> Index Scan Backward using people_scored on people
(cost=0.00..486.75 rows=132 width=12) (actual time=0.029..0.029 rows=1
loops=1000)
Index Cond: (group_id = $0)
-> Limit (cost=0.00..3.69 rows=1 width=12) (never executed)
-> Index Scan Backward using people_scored on people
(cost=0.00..486.75 rows=132 width=12) (never executed)
Index Cond: (group_id = $0)
Total runtime: 58.004 ms
----------------------------------------------------------------------------------------

Why is the Limited plan displayed twice with the second 'never executed',
I have no idea.
Now let's kill that JOIN thanks to a neat postgres feature :
ARRAY[score,user_id]

-- Take 3
----------------------------------------------------------------------------------------
SELECT group_id, top.us[1] as user_id, top.us[2] as user_id FROM (SELECT
g.group_id, (SELECT ARRAY[user_id,score] FROM people WHERE
group_id=g.group_id ORDER BY group_id DESC, score DESC LIMIT 1) as us FROM
groups g) as top;
----------------------------------------------------------------------------------------
Seq Scan on groups g (cost=0.00..7389.95 rows=1000 width=4) (actual
time=0.109..39.410 rows=1000 loops=1)
SubPlan
-> Limit (cost=0.00..3.69 rows=1 width=12) (actual
time=0.014..0.014 rows=1 loops=1000)
-> Index Scan Backward using people_scored on people
(cost=0.00..486.75 rows=132 width=12) (actual time=0.011..0.011 rows=1
loops=1000)
Index Cond: (group_id = $0)
-> Limit (cost=0.00..3.69 rows=1 width=12) (actual
time=0.016..0.017 rows=1 loops=1000)
-> Index Scan Backward using people_scored on people
(cost=0.00..486.75 rows=132 width=12) (actual time=0.014..0.014 rows=1
loops=1000)
Index Cond: (group_id = $0)
Total runtime: 39.956 ms
----------------------------------------------------------------------------------------

Nice !
However, here is something fishy : using "top.us[1] as user_id,
top.us[2]" actually executes the subselect TWICE !!!!! Can anybody point
the reason to me ? It doubles the query execution time, as the outer
SELECT is just a dumb wrapper.

Even with this, it is more than an order of magnitude faster than the
DISTINCT ON
I hope I haven't messed it up and that you get the same result ?

Note that the subselect here can only yield ONE row, so you can't select
the first N, only the first 1... unless you make it yield a 2-dimension
array (with array_accum). It's doable, but there is a nicer solution ...

----------------------------------------------------------------------------------------
Stored procs :

-- Take 1
----------------------------------------------------------------------------------------
CREATE OR REPLACE FUNCTION topn( INTEGER ) RETURNS SETOF people LANGUAGE
PLPGSQL AS $$
DECLARE
_nrows ALIAS FOR $1;
_gid groups;
_p people;
BEGIN
FOR _gid IN SELECT * FROM groups LOOP
FOR _p IN SELECT * FROM people WHERE group_id=_gid.group_id ORDER BY
group_id DESC, score DESC LIMIT _nrows LOOP
RETURN NEXT _p;
END LOOP;
END LOOP;
RETURN;
END;
$$;

Here being a stored proc, we can specify the number of records to return
in each group. Luxury...

test=# EXPLAIN ANALYZE SELECT * FROM topn(1);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Function Scan on topn (cost=0.00..12.50 rows=1000 width=12) (actual
time=90.759..91.927 rows=1000 loops=1)
Total runtime: 92.371 ms
(2 lignes)

Note, 92 ms to do 1K SELECT's, not that bad !

The top3 by categories :

test=# SELECT * FROM topn(3) LIMIT 10;
user_id | group_id | score
---------+----------+-------
120224 | 1 | 980
53244 | 1 | 977
101540 | 1 | 971
73022 | 2 | 994
52085 | 2 | 992
43474 | 2 | 990
94110 | 3 | 982
96491 | 3 | 975
96387 | 3 | 964
89660 | 4 | 993
(10 lignes)

Restraining the groups to query in the function would improve speed a lot.

-- Take 2
----------------------------------------------------------------------------------------

Let's get rid of the groups table.
This one simulates an index skip scan on the people table to extract the
groups as it goes.

CREATE OR REPLACE FUNCTION topn_nogroups( INTEGER ) RETURNS SETOF people
LANGUAGE PLPGSQL AS $$
DECLARE
_nrows ALIAS FOR $1;
_gid INTEGER;
_p people;
BEGIN
-- get lowest gid
_gid := group_id FROM people ORDER BY group_id ASC LIMIT 1;
WHILE _gid IS NOT NULL LOOP
FOR _p IN SELECT * FROM people WHERE group_id=_gid ORDER BY group_id
DESC, score DESC LIMIT _nrows LOOP
RETURN NEXT _p;
END LOOP;
-- get next group_id
_gid := group_id FROM people WHERE group_id > _gid ORDER BY group_id ASC
LIMIT 1;
END LOOP;
RETURN;
END;
$$;

test=# SELECT * FROM topn_nogroups(4) LIMIT 10;
user_id | group_id | score
---------+----------+-------
120224 | 1 | 980
53244 | 1 | 977
101540 | 1 | 971
69530 | 1 | 966
73022 | 2 | 994
52085 | 2 | 992
43474 | 2 | 990
4022 | 2 | 989
94110 | 3 | 982
96491 | 3 | 975
(10 lignes)

test=# EXPLAIN ANALYZE SELECT * FROM topn_nogroups(1);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Function Scan on topn_nogroups (cost=0.00..12.50 rows=1000 width=12)
(actual time=152.869..154.049 rows=1000 loops=1)
Total runtime: 154.493 ms

-- Take 3
----------------------------------------------------------------------------------------

If we get accept returning only one row per group we can remove one query
per loop iteration

CREATE OR REPLACE FUNCTION top_nogroups( ) RETURNS SETOF people LANGUAGE
PLPGSQL AS $$
DECLARE
_gid INTEGER;
_p people;
BEGIN
-- get lowest gid
_gid := group_id FROM people ORDER BY group_id DESC LIMIT 1;
IF _gid IS NULL THEN RETURN; END IF;
_gid := _gid + 1;
LOOP
SELECT INTO _p * FROM people WHERE group_id<_gid ORDER BY group_id DESC,
score DESC LIMIT 1;
IF NOT FOUND THEN RETURN; END IF;
_gid = _p.group_id;
RETURN NEXT _p;
END LOOP;
RETURN;
END;
$$;

test=# EXPLAIN ANALYZE SELECT * FROM top_nogroups();
---------------------------------------------------------------------------------------------------------------------
Function Scan on top_nogroups (cost=0.00..12.50 rows=1000 width=12)
(actual time=59.086..60.319 rows=1000 loops=1)
Total runtime: 60.712 ms
(2 lignes)

Performance is about the same level as the first query with the subselect
(even though the subselect is executed twice) BUT you don't need the
groups table. You might like that.

I just love postgres. Think you have a problem, it has an elegant
solution. It even has some features it doesn't have, you just have to look
under the carpet and you find them (like the index skip scan).

Anyway, it was fun to experiment with that !

In response to

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Michael Fuhr 2005-02-08 03:31:41 Re: the best way to get the first record from each group
Previous Message Premsun Choltanwanich 2005-02-08 01:32:33 Re: How can I use large object on PostgreSQL Linux