Re: The Future of Aggregation

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: The Future of Aggregation
Date: 2015-06-09 16:19:59
Message-ID: 5577122F.9030101@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 06/09/15 17:27, Andres Freund wrote:
> On 2015-06-09 17:19:33 +0200, Tomas Vondra wrote:
>> ... and yet another use case for 'aggregate state combine' that I
>> just remembered about is grouping sets. What GROUPING SET (ROLLUP,
>> ...) do currently is repeatedly sorting the input, once for each
>> grouping.
>
> Actually, that's not really what happens. All aggregates that share
> a sort order are computed in parallel. Only when sets do not share
> an order additional sorts are required.

Oh, right, that's what I meant, but failed to explain clearly.

>> What could happen in some cases is building the most detailed
>> aggregationfirst, then repeatedly combine these partial states.
>
> I'm not sure that'll routinely be beneficial, because it'd require
> keeping track of all the individual "most detailed" results, no?

Yes, it requires tracking all the "detailed" aggregate states. I'm not
claiming this is beneficial in every case - sometimes the current sort
approach will be better, sometimes the combine approach will be faster.
In a sense it's similar to GroupAggregate vs. HashAggregate.

I expect this 'combine' approach will be much faster is cases with many
source rows, but number of groups so small the detailed states fit into
work_mem. In that case you can do hashagg and then walk through the hash
table to build the actual results. This entirely eliminates the
expensive sorts, which is killing us in many DWH workloads (because
real-world queries usually produce only few rows, even on very large
data sets).

But ISTM this might help even in cases when the detailed states don't
fit into memory, still assuming the number of groups is much smaller
than the number of source rows. Just do "partial aggregation" by
aggregating the source rows (using hashagg) until you fill work_mem.
Then either dump all the aggregate states to disk or only some of them
(least frequently used?) and continue. Then you can sort the states, and
assuming it's much smaller amount of data, it'll be much faster than
sorting all the rows. And you can do the grouping sets using multiple
sorts, just like today.

Of course, this only works if the partial aggregation actually reduces
the amount of data spilled to disk - if the aggregate states grow fast,
or if you get the tuples in certain order, it may get ugly.

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2015-06-09 16:24:02 Re: "could not adopt C locale" failure at startup on Windows
Previous Message Kevin Grittner 2015-06-09 15:42:46 Re: Aggregate Supporting Functions