Re: Final Patch for GROUPING SETS

From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Greg Sabino Mullane <greg(at)turnstep(dot)com>, Marti Raudsepp <marti(at)juffo(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Tomas Vondra <tv(at)fuzzy(dot)cz>, Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
Subject: Re: Final Patch for GROUPING SETS
Date: 2014-12-11 23:30:15
Message-ID: 8761dhu9ts.fsf@news-spur.riddles.org.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>>>>> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

>> What that code does is produce plans that look like this:

>> GroupAggregate
>> -> Sort
>> -> ChainAggregate
>> -> Sort
>> -> ChainAggregate

>> in much the same way that WindowAgg nodes are generated.

Tom> That seems pretty messy, especially given your further comments
Tom> that these plan nodes are interconnected and know about each
Tom> other (though you failed to say exactly how).

I'd already explained in more detail way back when we posted the
patch. But to reiterate: the ChainAggregate nodes pass through their
input data unchanged, but on group boundaries they write aggregated
result rows to a tuplestore shared by the whole chain. The top node
returns the data from the tuplestore after its own output is
completed.

The chain_head pointer in the ChainAggregate nodes is used for:

- obtaining the head node's targetlist and qual, to use to project
rows into the tuplestore (the ChainAggregate nodes don't do
ordinary projection so they have dummy targetlists like the Sort
nodes do)

- obtaining the pointer to the tuplestore itself

- on rescan without parameter change, to inform the parent node
whether or not the child nodes are also being rescanned (since
the Sort nodes may or may not block this)

Tom> The claimed analogy to WindowAgg therefore seems bogus since
Tom> stacked WindowAggs are independent, AFAIR anyway.

The analogy is only in that they need to see the same input rows but
in different sort orders.

Tom> I'm also wondering about performance: doesn't this imply more
Tom> rows passing through some of the plan steps than really
Tom> necessary?

There's no way to cut down the number of rows seen by intermediate
nodes unless you implement (and require) associative aggregates, which
we do not do in this patch (that's left for possible future
optimization efforts). Our approach makes no new demands on the
implementation of aggregate functions.

Tom> Also, how would this extend to preferring hashed aggregation in
Tom> some of the grouping steps?

My suggestion for extending it to hashed aggs is: by having a (single)
HashAggregate node keep multiple hash tables, per grouping set, then
any arbitrary collection of grouping sets can be handled in one node
provided that memory permits and no non-hashable features are used.
So the normal plan for CUBE(a,b) under this scheme would be just:

HashAggregate
Grouping Sets: (), (a), (b), (a,b)
-> (input path in unsorted order)

If a mixture of hashable and non-hashable data types are used, for
example CUBE(hashable,unhashable), then a plan of this form could be
constructed:

HashAggregate
Grouping Sets: (), (hashable)
-> ChainAggregate
Grouping Sets: (unhashable), (unhashable,hashable)
-> (input path sorted by (unhashable,hashable))

Likewise, plans of this form could be considered for cases like
CUBE(low_card, high_card) where hashed grouping on high_card would
require excessive memory:

HashAggregate
Grouping Sets: (), (low_card)
-> ChainAggregate
Grouping Sets: (high_card), (high_card, low_card)
-> (input path sorted by (high_card, low_card))

Tom> ISTM that maybe what we should do is take a cue from the SQL
Tom> spec, which defines these things in terms of UNION ALL of
Tom> plain-GROUP-BY operations reading from a common CTE.

I looked at that, in fact that was our original plan, but it became
clear quite quickly that it was not going to be easy.

I tried two different approaches. First was to actually re-plan the
input (i.e. running query_planner more than once) for different sort
orders; that crashed and burned quickly thanks to the extent to which
the planner assumes that it'll be run once only on any given input.

Second was to generate a CTE for the input data. This didn't get very
far because everything that already exists to handle CTE nodes assumes
that they are explicit in the planner's input (that they have their
own Query node, etc.) and I was not able to determine a good solution.
If you have any suggestions for how to approach the problem, I'm happy
to have another go at it.

--
Andrew (irc:RhodiumToad)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2014-12-11 23:35:55 Re: Commitfest problems
Previous Message Simon Riggs 2014-12-11 23:00:22 Re: Proposal : REINDEX SCHEMA