Best way to get the latest revision from a table

From: Nikolas Everett <nik9000(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Best way to get the latest revision from a table
Date: 2011-01-14 21:17:28
Message-ID: AANLkTikVZG_Us9R9iLdz=GnqqEnHE5D7Go9xkENkdGiQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

I'm using 8.3 and I have a table that contains many revisions of the same
entity and I have a query that is super slow. Please help! I'm going to
paste in some SQL to set up test cases and some plans below. If that isn't
the right way to post to this list please let me know and I'll revise.

My table looks kind of like this but wider:
CREATE TEMPORARY TABLE test (revision SERIAL NOT NULL PRIMARY KEY, a INTEGER
NOT NULL, b INTEGER NOT NULL, c INTEGER NOT NULL);
INSERT INTO test (a, b, c) SELECT a, 1, 25 FROM generate_series(1, 100000)
AS t1(a), generate_series(1, 10) as t2(b);
CREATE INDEX test_a ON test (a);
ANALYZE test;

I need to SELECT all the columns with the latest revision for a subset of
As. What is the right way to do this quickly?

When I do it like this:
CREATE TEMPORARY TABLE request (a INTEGER NOT NULL);
INSERT INTO request SELECT a FROM generate_series(2, 200) AS t(a);
ANALYZE request;
SELECT *
FROM request
JOIN (SELECT a, MAX(b) as b FROM test GROUP BY a) max USING (a)
JOIN test USING (a, b);
DROP TABLE request;

The plan for the SELECT is pretty bad:
"Hash Join (cost=32792.50..77907.29 rows=62288 width=20) (actual
time=769.570..2222.050 rows=199 loops=1)"
" Hash Cond: ((max(pg_temp_7.test.revision)) = pg_temp_7.test.revision)"
" -> Hash Join (cost=5.48..38659.23 rows=62288 width=8) (actual
time=20.621..830.235 rows=199 loops=1)"
" Hash Cond: (pg_temp_7.test.a = request.a)"
" -> GroupAggregate (cost=0.00..37170.11 rows=62601 width=8)
(actual time=16.847..808.475 rows=100000 loops=1)"
" -> Index Scan using test_a on test (cost=0.00..31388.04
rows=999912 width=8) (actual time=16.826..569.035 rows=1000000 loops=1)"
" -> Hash (cost=2.99..2.99 rows=199 width=4) (actual
time=3.736..3.736 rows=199 loops=1)"
" -> Seq Scan on request (cost=0.00..2.99 rows=199 width=4)
(actual time=3.658..3.689 rows=199 loops=1)"
" -> Hash (cost=15405.12..15405.12 rows=999912 width=16) (actual
time=723.673..723.673 rows=1000000 loops=1)"
" -> Seq Scan on test (cost=0.00..15405.12 rows=999912 width=16)
(actual time=0.006..290.313 rows=1000000 loops=1)"
"Total runtime: 2222.267 ms"

If I instead issue the query as:
CREATE TEMPORARY TABLE request (a INTEGER NOT NULL, revision INTEGER);
INSERT INTO request SELECT a FROM generate_series(2, 200) AS t(a);
UPDATE request SET revision = (SELECT MAX(revision) FROM test WHERE
request.a = test.a);
ANALYZE request;
SELECT *
FROM request
JOIN test USING (revision)
DROP TABLE request;

The whole thing runs tons faster. The UPDATE uses the right index and is
way sub second and the SELECT's plan is fine:
"Merge Join (cost=11.66..76.09 rows=199 width=20) (actual time=0.131..0.953
rows=199 loops=1)"
" Merge Cond: (test.revision = request.revision)"
" -> Index Scan using test_pkey on test (cost=0.00..31388.04 rows=999912
width=16) (actual time=0.017..0.407 rows=2001 loops=1)"
" -> Sort (cost=11.59..12.09 rows=199 width=8) (actual time=0.102..0.133
rows=199 loops=1)"
" Sort Key: request.revision"
" Sort Method: quicksort Memory: 34kB"
" -> Seq Scan on request (cost=0.00..3.99 rows=199 width=8) (actual
time=0.020..0.050 rows=199 loops=1)"
"Total runtime: 1.005 ms"

Am I missing something or is this really the best way to do this in 8.3?

Thanks for slogging through all this,

Nik Everett

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Chris Browne 2011-01-14 21:19:42 Re: "COPY TO stdout" statements occurrence in log files
Previous Message Maciek Sakrejda 2011-01-14 20:27:02 Re: "COPY TO stdout" statements occurrence in log files