Re: Optimizer on sort aggregate

From: Feng Tian <ftian(at)vitessedata(dot)com>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Optimizer on sort aggregate
Date: 2014-10-18 01:25:14
Message-ID: CAFWGqntWU7+M_qm5ASyypvJbxpOCs5y8DHeA5sQWWSHs28c1Jw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi, David,

Yes, switch sorting order would loose an interesting order so if user
dictates order by t, i; planner need to resort to its cost model.
Estimating cardinality of groupby is a much bigger topic than this thread.

I feel sorting string as if it is bytea is particularly interesting. I am
aware Peter G's patch and I think it is great, but for this sort agg case,
first, I believe it is still slower than sorting bytea, and second, Peter
G's patch depends on data. A common long prefix will make the patch less
effective, which is probably not so uncommon (for example, URL with a
domain prefix). I don't see any downside of sort bytea, other than lost
the interest ordering.

Thanks,
Feng

On Fri, Oct 17, 2014 at 4:36 PM, David Rowley <dgrowleyml(at)gmail(dot)com> wrote:

> On Sat, Oct 18, 2014 at 5:10 AM, Feng Tian <fengttt(at)gmail(dot)com> wrote:
>
>> Hi,
>>
>> Consider the following queries.
>>
>> create table t(i int, j int, k int, t text);
>> insert into t select i, i % 100, i % 1000, 'AABBCCDD' || i from
>> generate_series(1, 1000000) i;
>>
>> ftian=# explain select count(distinct j) from t group by t, i;
>> QUERY PLAN
>> ------------------------------------------------------------------------
>> GroupAggregate (cost=158029.84..178029.84 rows=1000000 width=22)
>> -> Sort (cost=158029.84..160529.84 rows=1000000 width=22)
>> Sort Key: t, i
>> -> Seq Scan on t (cost=0.00..17352.00 rows=1000000 width=22)
>> (4 rows)
>>
>>
>> The query,
>> select count(distinct j) from t group by t, i;
>>
>> runs for 35 seconds. However, if I change the query to,
>> select count(distinct j) from t group by i, t; -- note switching the
>> ordering
>> select count(distinct j) from t group by decode(t, 'escape'), i; --
>> convert t to bytea
>>
>> Run times are just about 5 and 6.5 seconds. The reason is clear, compare
>> a string with collation is slow, which is well understood by pg hackers.
>> However, here, the sorting order is forced by the planner, not user.
>> Planner can do the following optimizations,
>>
>> 1. for the sort we generated for sort agg, planner can switch column
>> ordering, put int before string,
>> 2. for the sort we generated for sort agg, use bytea compare instead of
>> string compare.
>>
>> They will bring big improvement to this common query. Is this something
>> reasonable?
>>
>>
> It seems like it might be worth looking into, but I think it's likely more
> complex than sorting the group by order into narrowest column first.
>
> If the query was:
>
> select count(distinct j) from t group by t, i order by t, i;
>
> Then if that was rewritten to make the group by i,t then the planner
> would then need to perform an additional sort on t,i to get the correct
> order by for the final result, which may or may not be faster, it would
> depend on how many groups there were to sort. Though I guess you could
> possibly just skip the optimisation if the next node up didn't need any
> specific ordering.
>
> I also wonder if taking into account the column's width is not quite
> enough. For example if the 'i' column only had 1 distinct value, then
> performing the group by on 'i' first wouldn't help at all. Keep in mind
> that the columns could be much closer in width than in your example, e.g
> int and bigint. Perhaps you'd need to include some sort of heuristics to
> look at the number of distinct values and the width, and form some sort of
> weighted estimates about which column to put first.
>
> Regards
>
> David Rowley
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2014-10-18 02:02:04 Re: Directory/File Access Permissions for COPY and Generic File Access Functions
Previous Message Tom Lane 2014-10-18 00:39:22 Re: Hide 'Execution time' in EXPLAIN (COSTS OFF)