Query Planner Filtering Of Specified Value From 'On Distinct' Clause

From: Alex Satrapa <alex(at)lintelsys(dot)com(dot)au>
To: PostgreSQL General <pgsql-general(at)postgresql(dot)org>
Subject: Query Planner Filtering Of Specified Value From 'On Distinct' Clause
Date: 2003-11-17 08:55:50
Message-ID: 3FB88D16.30007@lintelsys.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I have a table from which I'm trying to extract certain information. For
historical reasons, we archive every action on a particular thing
('thing' is identified, funnily enough, by 'id'). So the only way to
find out the current state of a particular combination of attributes is
to "select distinct on (id, ...) ... order by date desc".

In the examples below, I've taken real output from psql and done a
global search/replace on various names to convince myself that I'm not
leaking information. The numbers are intact, just the column/table/view
names are different.

For sake of argument, here is the table structure:
database=> \d souce
Table "public.source"
Column | Type | Modifiers
---------+-----------------------------+---------------
id | integer | not null
date | timestamp without time zone | default 'now'
second | integer | not null
third | text | not null
Indexes: source btree (id)

This table contains about 98000 rows. Let's create a view into the
"source" table:

database=> create view myview as select distinct
on (id, second, third) *
from source
order by id, second, third, date desc;
CREATE VIEW

Now let's look at the query plan for "select * from myview":

database=> explain analyze select * from myview;
QUERY PLAN
--------------------------------------------------------------
Subquery Scan myview (cost=12483.55..13463.96 rows=9804 width=63)
(actual time=1262.08..1450.68 rows=21089 loops=1)
-> Unique (cost=12483.55..13463.96 rows=9804 width=63)
(actual time=1262.06..1406.71 rows=21089 loops=1)
-> Sort (cost=12483.55..12728.65 rows=98041 width=63)
(actual time=1262.06..1339.34 rows=98063 loops=1)
Sort Key: source.id, source.second, source."third",
source.date
-> Seq Scan on source (cost=0.00..2247.41 rows=98041
width=63) (actual time=0.02..269.07 rows=98063 loops=1)
Total runtime: 1467.78 msec
(6 rows)

I'm not sure if there's really anything I can do to make that go faster.
But look what happens when I specify a particular value for one of the
fields in the "on distinct" clause:

database=> explain analyze select * from myview where id=12345;
QUERY PLAN
--------------------------------------------------------------
Subquery Scan myview (cost=12483.55..13463.96 rows=9804 width=63)
(actual time=1446.58..1446.81 rows=2 loops=1)
Filter: (id = 12345)
-> Unique (cost=12483.55..13463.96 rows=9804 width=63)
(actual time=1258.87..1434.39 rows=21089 loops=1)
-> Sort (cost=12483.55..12728.65 rows=98041 width=63)
(actual time=1258.87..1336.89 rows=98063 loops=1)
Sort Key: source.id, source.second, source."third",
source.date
-> Seq Scan on source (cost=0.00..2247.41 rows=98041 width=63)
(actual time=0.02..268.54 rows=98063 loops=1)
Total runtime: 1457.53 msec
(7 rows)

Hmm.. not as efficient as I would have liked. The planner happily
executed the whole view, and only then filtered out just the bits that I
wanted. The most expensive parts of the query were done on the
unfiltered set. The result set is only 0.002% of the whole data set, so
I was a little shocked that it took *longer* to generate the results
that I wanted, than it took to create the unfiltered results!

Is there any reason why it would not be sensible to push the filter down
to just after the first sequential scan of source? Since "id" is one of
the fields in the "on distinct" condition, doesn't it make sense that
pushing the filter further down would reduce the amount of work done to
get exactly the same result? The same would be true if I was to filter
on a specific "second" or "third" value.

I understand that filtering on columns not present in the "on distinct"
clause would have drastically different effects when done before or
after the "Unique" processing. However, since the column(s) I'm
filtering on are only those present in the "on distinct" clause, it
makes no difference whether the filter is before or after the
"Unique"-ifying.

Am I totally stark, raving mad?
Alex Satrapa

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Fajar Priyanto 2003-11-17 09:12:14 Newbie: port
Previous Message Keith C. Perry 2003-11-17 03:43:08 Re: Retoring non-administrative user databases