Re: unnecessary sort in the execution plan when doing group by

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: "Huang, Suya" <Suya(dot)Huang(at)au(dot)experian(dot)com>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: unnecessary sort in the execution plan when doing group by
Date: 2014-10-28 07:06:43
Message-ID: CAApHDvofgjgP_O97-LL-KoqQeAxxvuk7dzcG_Xc1SUUoBFeVJQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Tue, Oct 28, 2014 at 7:26 PM, Huang, Suya <Suya(dot)Huang(at)au(dot)experian(dot)com>
wrote:

> Hi,
>
>
>
> This is the Greenplum database 4.3.1.0.
>
>
Likely this is the wrong place to ask for help. The plan output that you've
pasted below looks very different to PostgreSQL's EXPLAIN output.

QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Gather Motion 24:1 (slice2; segments: 24) (cost=31286842.08..31287447.81
rows=1683 width=536)

Rows out: 15380160 rows at destination with 14860 ms to first row,
23856 ms to end, start offset by 104 ms.

-> HashAggregate (cost=31286842.08..31287447.81 rows=1683 width=536)

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Gather Motion 24:1 (slice2; segments: 24)
(cost=152269717.33..157009763.41 rows=1196982 width=568)

Rows out: 15380160 rows at destination with 35320 ms to first row,
70091 ms to end, start offset by 102 ms.

-> GroupAggregate (cost=152269717.33..157009763.41 rows=1196982
width=568)

Most likely the reason you're getting the difference in plan is because the
planner is probably decided that there will be too many hash entries for a
hash table based on the 3 grouping columns... Look at the estimates, 1683 with
2 columns and 1196982 with the 3 columns. If those estimates turned out to
be true, then the hash table for 3 columns will be massively bigger than it
would be with 2 columns. With PostgreSQL you might see the plan changing if
you increased the work_mem setting. For greenplum, I've no idea if that's
the same.

Databases are often not very good at knowing with the number of distinct
values would be over more than 1 column. Certain databases have solved this
with multi column statistics, but PostgreSQL does not have these. Although
I just noticed last night that someone is working on them.

Regards

David Rowley

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message jmcdonagh 2014-10-28 20:55:39 Incredibly slow restore times after 9.0>9.2 upgrade
Previous Message Huang, Suya 2014-10-28 06:26:48 unnecessary sort in the execution plan when doing group by