equivalent queries lead to different query plans for self-joins with group by?

From: Ben <midfield(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: equivalent queries lead to different query plans for self-joins with group by?
Date: 2010-11-11 21:52:57
Message-ID: 5D061DAC-31E2-4C47-AE3B-7799118C57D3@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

dear pgers --

consider the following toy example (using pg 8.4.3) :

create temporary table foo (
ts timestamp not null,
id integer not null,
val double precision not null,
primary key (ts, id)
);

i might want to return the vals, minus the averages at each timestamp. the obvious self-join results in a sequential scan over foo -- we aggregate the average val for EVERY timestamp, then join against the timestamps we want.

us_quotedb=# explain select ts, id, val - aval from foo join (select ts, avg(val) as aval from foo group by ts) as a using (ts) where ts > '2010-11-11' and ts < '2010-11-13';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=49.06..54.41 rows=8 width=28)
Hash Cond: (pg_temp_2.foo.ts = pg_temp_2.foo.ts)
-> HashAggregate (cost=34.45..36.95 rows=200 width=16)
-> Seq Scan on foo (cost=0.00..26.30 rows=1630 width=16)
-> Hash (cost=14.51..14.51 rows=8 width=20)
-> Bitmap Heap Scan on foo (cost=4.33..14.51 rows=8 width=20)
Recheck Cond: ((ts > '2010-11-11 00:00:00'::timestamp without time zone) AND (ts < '2010-11-13 00:00:00'::timestamp without time zone))
-> Bitmap Index Scan on foo_pkey (cost=0.00..4.33 rows=8 width=0)
Index Cond: ((ts > '2010-11-11 00:00:00'::timestamp without time zone) AND (ts < '2010-11-13 00:00:00'::timestamp without time zone))

on the other hand, if i specify "which" timestamp i'm restricting, it appears to do the right thing:

us_quotedb=# explain select ts, id, val - aval from foo join (select ts, avg(val) as aval from foo group by ts) as a using (ts) where a.ts > '2010-11-11' and a.ts < '2010-11-13';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=18.86..29.14 rows=8 width=28)
-> HashAggregate (cost=14.55..14.56 rows=1 width=16)
-> Bitmap Heap Scan on foo (cost=4.33..14.51 rows=8 width=16)
Recheck Cond: ((ts > '2010-11-11 00:00:00'::timestamp without time zone) AND (ts < '2010-11-13 00:00:00'::timestamp without time zone))
-> Bitmap Index Scan on foo_pkey (cost=0.00..4.33 rows=8 width=0)
Index Cond: ((ts > '2010-11-11 00:00:00'::timestamp without time zone) AND (ts < '2010-11-13 00:00:00'::timestamp without time zone))
-> Bitmap Heap Scan on foo (cost=4.31..14.45 rows=8 width=20)
Recheck Cond: (pg_temp_2.foo.ts = pg_temp_2.foo.ts)
-> Bitmap Index Scan on foo_pkey (cost=0.00..4.31 rows=8 width=0)
Index Cond: (pg_temp_2.foo.ts = pg_temp_2.foo.ts)

i find this behavior curious. my understanding is that both queries are equivalent, and i would expect that the query planner would be able to choose either of those plans. this is important -- with the real data i'm working with, the table is very large, and the sequential scan is a killer.

are these queries equivalent, or am i mistaken? if the planner distinguishes between these plans, how do i ensure that where clause restrictions propagate (correctly) to subqueries?

best regards, ben

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2010-11-11 22:37:49 Re: equivalent queries lead to different query plans for self-joins with group by?
Previous Message Kenneth Marshall 2010-11-11 21:28:19 Re: anti-join chosen even when slower than old plan