queries with DISTINCT / GROUP BY giving different plans

From: "Tomas Vondra" <tv(at)fuzzy(dot)cz>
To: pgsql-performance(at)postgresql(dot)org
Subject: queries with DISTINCT / GROUP BY giving different plans
Date: 2013-08-14 15:33:53
Message-ID: 3a56c4d2c0b2daa11a625c2983823901.squirrel@sq.gransy.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi all,

I've run into a strange plan difference on 9.1.9 - the first query does
"DISTINCT" by doing a GROUP BY on the columns (both INT).

SELECT
"f_account"."name_id" AS "a_1550",
"f_message"."text_id" AS "a_1562"
FROM "f_accountmessagefact"
INNER JOIN "f_message" ON ( "f_accountmessagefact"."message_id" =
"f_message"."id" )
INNER JOIN "f_account" ON ( "f_accountmessagefact"."account_id" =
"f_account"."id" )
GROUP BY 1, 2;

QUERY PLAN
-----------------------------------------------------------------------------------------
Group (cost=3575011.59..3721066.43 rows=19473978 width=8)
-> Sort (cost=3575011.59..3623696.54 rows=19473978 width=8)
Sort Key: f_account.name_id, f_message.text_id
-> Hash Join (cost=51718.44..1217195.39 rows=19473978 width=8)
Hash Cond: (f_accountmessagefact.account_id = f_account.id)
-> Hash Join (cost=51699.42..949409.18 rows=19473978
width=8)
Hash Cond: (f_accountmessagefact.message_id =
f_message.id)
-> Seq Scan on f_accountmessagefact
(cost=0.00..435202.78 rows=19473978 width=8)
-> Hash (cost=37002.52..37002.52 rows=1175752 width=8)
-> Seq Scan on f_message (cost=0.00..37002.52
rows=1175752 width=8)
-> Hash (cost=11.23..11.23 rows=623 width=8)
-> Seq Scan on f_account (cost=0.00..11.23 rows=623
width=8)
(12 rows)

Now, this takes ~45 seconds to execute, but after rewriting the query to
use the regular DISTINCT it suddenly switches to HashAggregate with ~1/3
the cost (although it produces the same output, AFAIK), and it executes in
~15 seconds.

SELECT DISTINCT
"f_account"."name_id" AS "a_1550",
"f_message"."text_id" AS "a_1562"
FROM "f_accountmessagefact"
INNER JOIN "f_message" ON ( "f_accountmessagefact"."message_id" =
"f_message"."id" )
INNER JOIN "f_account" ON ( "f_accountmessagefact"."account_id" =
"f_account"."id" );

QUERY PLAN
------------------------------------------------------------------------------------------
HashAggregate (cost=1314565.28..1509305.06 rows=19473978 width=8)
-> Hash Join (cost=51718.44..1217195.39 rows=19473978 width=8)
Hash Cond: (f_accountmessagefact.account_id = f_account.id)
-> Hash Join (cost=51699.42..949409.18 rows=19473978 width=8)
Hash Cond: (f_accountmessagefact.message_id = f_message.id)
-> Seq Scan on f_accountmessagefact (cost=0.00..435202.78
rows=19473978 width=8)
-> Hash (cost=37002.52..37002.52 rows=1175752 width=8)
-> Seq Scan on f_message (cost=0.00..37002.52
rows=1175752 width=8)
-> Hash (cost=11.23..11.23 rows=623 width=8)
-> Seq Scan on f_account (cost=0.00..11.23 rows=623 width=8)
(10 rows)

I've tested this with other queries and those actually behave as expected
(both using HashAggregate), so I'm wondering what's wrong with this one
and why it's discarding a plan with much lower cost. Any ideas?

The estimates are quite exact (and exactly the same for both queries).

BTW I can't test this on 9.2 or 9.3 easily, as this is our production
environment and I can't just export the data. I've tried to simulate this
but so far no luck.

regards
Tomas

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2013-08-14 18:35:14 Re: queries with DISTINCT / GROUP BY giving different plans
Previous Message Daniel Cristian Cruz 2013-08-14 12:21:39 Re: Index on a range array