Quick Links

Rapidly finding maximal rows

From:	James Cranch <jdc41(at)cam(dot)ac(dot)uk>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Rapidly finding maximal rows
Date:	2011-10-11 10:16:06
Message-ID:	Prayer.1.3.4.1110111116060.9639@hermes-2.csi.cam.ac.uk
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

I have a slow query, based on the problem of finding the set of rows which
are maximal in some sense. I expect that there is either a more intelligent
way to write it, or that one could make some indexes that would speed it
up. I've tried various indexes, and am not getting anywhere.

I'd be grateful for any suggestions. Reasonably full details are below.

DESCRIPTION
===========

I am writing a database which stores scores from mass-entry competitions
("challenges"). Candidates who were the best in their school (schools being
identified by their "centre_number") receive a special certificate in each
competition.

Currently the database has data from six competitions, each entered by
something like 200,000 students from about 2,000 schools. I wish to produce
a view showing the students who had best-in-school performances.

I have made two attempts (both immediately after running VACUUM ANALYZE),
and both are surprisingly slow.

THE QUERIES
===========

I'm interested in running something like

SELECT * FROM best_in_school_methodN WHERE competition_name = 'X' AND
academic_year_beginning = 2010

and the following two variants have been tried:

CREATE VIEW best_in_school_method1 AS
WITH best_scores(competition_name, academic_year_beginning,
centre_number, total_score) AS
(SELECT competition_name, academic_year_beginning, centre_number,
MAX(total_score) AS total_score FROM challenge_entries GROUP BY
competition_name, academic_year_beginning, centre_number)
SELECT competition_name, academic_year_beginning, centre_number,
entry_id, total_score, (true) AS best_in_school FROM challenge_entries
NATURAL JOIN best_scores;

This is EXPLAIN ANALYZEd here:
http://explain.depesz.com/s/EiS

CREATE VIEW best_in_school_method2 AS
WITH innertable(competition_name, academic_year_beginning,
centre_number, entry_id, total_score, school_max_score) AS
(SELECT competition_name, academic_year_beginning, centre_number,
entry_id, total_score, MAX(total_score) OVER (PARTITION BY
competition_name, academic_year_beginning, centre_number) AS
centre_max_score FROM challenge_entries)
SELECT competition_name, academic_year_beginning, centre_number,
entry_id, total_score, (true) AS best_in_school FROM innertable WHERE
centre_max_score = total_score;

This one is EXPLAIN ANALYZEd here:
http://explain.depesz.com/s/6Eh

COMMENT
=======

In both cases, unless I've misunderstood, most of the time is taken up by
sorting all the results for that particular competition. It appears to me
that there should be much better ways: the results do not need to be fully
sorted.

If I were such an expert, I wouldn't be asking you all though.

By the way, the choice to SELECT a value of true is so that I can join the
results back into the original table to easily produce a best in school
boolean column.

SCHEMA
======

I should explain the tables(though probably only the last one is
interesting) and the index mentioned by one of the EXPLAINs. They can be
produced by

CREATE TABLE school_years
(
yearname VARCHAR(5) PRIMARY KEY,
minimum_usual_age_september1 INTERVAL,
maximum_usual_age_september1 INTERVAL,
usual_successor VARCHAR(5) REFERENCES school_years
);

CREATE TABLE challenge_types
(
competition_name TEXT PRIMARY KEY,
too_young_yeargroup VARCHAR(5) REFERENCES school_years
);

CREATE TABLE challenges
(
competition_name TEXT REFERENCES challenge_types,
academic_year_beginning INTEGER,
competition_date DATE,
CONSTRAINT competition_is_in_year CHECK (competition_date BETWEEN (date
(academic_year_beginning || '.001') + interval '9 months') AND (date
(academic_year_beginning || '.001') + interval '21 months')),
CONSTRAINT one_challenge_per_year UNIQUE
(academic_year_beginning,competition_name),
PRIMARY KEY (competition_name, academic_year_beginning)
);

CREATE TABLE challenge_entries
(
entry_id SERIAL,
competition_name TEXT,
academic_year_beginning INTEGER,
given_name TEXT,
surname TEXT,
centre_number CHAR(6),
school_year VARCHAR(5),
date_of_birth DATE,
uk_educated BOOLEAN,
uk_passport BOOLEAN,
sex SEX,
total_score INTEGER NOT NULL DEFAULT 0,
PRIMARY KEY (competition_name,academic_year_beginning,entry_id),
FOREIGN KEY (school_year) REFERENCES school_years,
FOREIGN KEY (competition_name,academic_year_beginning) REFERENCES
challenges );

CREATE INDEX challenge_entries_by_competition_centre_number_and_total_score
ON challenge_entries
(competition_name,academic_year_beginning,centre_number,total_score DESC);

SOFTWARE AND HARDWARE
=====================

I'm running "PostgreSQL 8.4.8 on x86_64-pc-linux-gnu, compiled by GCC
gcc-4.4.real (Debian 4.4.5-8) 4.4.5, 64-bit". It's the standard
installation from Debian stable (Squeeze), and I haven't messed around with
it.

My Linux kernel is 2.6.32-5-amd64.

I have a desktop PC with a Intel Core i7 CPU and 6GB of RAM, and a single
640GB Hitachi HDT72106 disk. My root partition is less than 30% full.

Responses

Re: Rapidly finding maximal rows at 2011-10-11 23:22:06 from bricklen
Re: Rapidly finding maximal rows at 2011-10-12 01:05:27 from Dave Crooke

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Ivan Voras	2011-10-11 10:20:45	Re: Adding more memory = hugh cpu load
Previous Message	Leonardo Francalanci	2011-10-11 08:57:57	Re: Adding more memory = hugh cpu load