From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Greg Stark <stark(at)mit(dot)edu>, Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Greg Sabino Mullane <greg(at)turnstep(dot)com>, Marti Raudsepp <marti(at)juffo(dot)org>, Noah Misch <noah(at)leadboat(dot)com>, Tomas Vondra <tv(at)fuzzy(dot)cz>, Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Svenne Krap <svenne(at)krap(dot)dk> |
Subject: | Re: Final Patch for GROUPING SETS |
Date: | 2015-05-12 03:36:19 |
Message-ID: | 20150512033619.GD32151@alap3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 2015-04-30 05:35:26 +0100, Andrew Gierth wrote:
> >>>>> "Andres" == Andres Freund <andres(at)anarazel(dot)de> writes:
>
> Andres> This is not a real review. I'm just scanning through the
> Andres> patch, without reading the thread, to understand if I see
> Andres> something "worthy" of controversy. While scanning I might have
> Andres> a couple observations or questions.
>
> >> + * A list of grouping sets which is structurally equivalent to a ROLLUP
> >> + * clause (e.g. (a,b,c), (a,b), (a)) can be processed in a single pass over
> >> + * ordered data. We do this by keeping a separate set of transition values
> >> + * for each grouping set being concurrently processed; for each input tuple
> >> + * we update them all, and on group boundaries we reset some initial subset
> >> + * of the states (the list of grouping sets is ordered from most specific to
> >> + * least specific). One AGG_SORTED node thus handles any number of grouping
> >> + * sets as long as they share a sort order.
>
> Andres> Found "initial subset" not very clear, even if I probably
> Andres> guessed the right meaning.
>
> How about:
>
> * [...], and on group boundaries we reset those states
> * (starting at the front of the list) whose grouping values have
> * changed (the list of grouping sets is ordered from most specific to
> * least specific). One AGG_SORTED node thus handles any number [...]
sounds good.
> >> + * To handle multiple grouping sets that _don't_ share a sort order, we use
> >> + * a different strategy. An AGG_CHAINED node receives rows in sorted order
> >> + * and returns them unchanged, but computes transition values for its own
> >> + * list of grouping sets. At group boundaries, rather than returning the
> >> + * aggregated row (which is incompatible with the input rows), it writes it
> >> + * to a side-channel in the form of a tuplestore. Thus, a number of
> >> + * AGG_CHAINED nodes are associated with a single AGG_SORTED node (the
> >> + * "chain head"), which creates the side channel and, when it has returned
> >> + * all of its own data, returns the tuples from the tuplestore to its own
> >> + * caller.
>
> Andres> This paragraph deserves to be expanded imo.
>
> OK, but what in particular needs clarifying?
I'm not sure ;). I obviously was a bit tired...
> Andres> Are you intending to resolve this before an eventual commit?
...
> Andres> Possibly after the 'structural' issues are resolved? Or do you
> Andres> think this can safely be put of for another release?
>
> I think the feature is useful even without AGG_HASHED, even though that
> means it can sometimes be beaten on performance by using UNION ALL of
> many separate GROUP BYs; but I'd defer to the opinions of others on that
> point.
I agree.
> Andres> * Arguably this converts the execution *tree* into a DAG. Tom
> Andres> seems to be rather uncomfortable with that. I am wondering
> Andres> whether this really is a big deal - essentially this only
> Andres> happens in a relatively 'isolated' part of the tree right?
> Andres> I.e. if those chained together nodes were considered one node,
> Andres> there would not be any loops? Additionally, the way
> Andres> parametrized scans works already essentially "violates" the
> Andres> tree paradigma somewhat.
>
> The major downsides as I see them with the current approach are:
>
> 1. It makes plans (and hence explain output) nest very deeply if you
> have complex grouping sets (especially cubes with high dimensionality).
That doesn't concern me overly much. If we feel the need to fudge the
explain output we certainly can.
> 2. A union-based approach would have a chance of including AGG_HASHED
> support without any significant code changes, just by using one HashAgg
> node per qualifying grouping set. However, this would be potentially
> significantly slower than teaching HashAgg to do multiple grouping sets,
> and memory usage would be an issue.
Your "however" imo pretty much disqualifies that as an argument.
> The obvious alternative is this:
>
> -> CTE x
> -> entire input subplan here
> -> Append
> -> GroupAggregate
> -> Sort
> -> CTE Scan x
> -> GroupAggregate
> -> Sort
> -> CTE Scan x
> -> HashAggregate
> -> CTE Scan x
> [...]
>
> which was basically what we expected to do originally. But all of the
> existing code to deal with CTE / CTEScan is based on the assumption that
> each CTE has a rangetable entry before planning starts, and it is
> completely non-obvious how to arrange to generate such CTEs on the fly
> while planning. Tom offered in December to look into that aspect for
> us, and of course we've heard nothing about it since.
I find Noah's argument about this kind of structure pretty
convincing. We'd either increase the number of reads, or baloon the
amount of memory if we'd manage to find a structure that'd allow several
of the aggregates to be computed at the same time.
Looking at this some more, I do think the current structure makes sense.
I do think we could flatten this into the toplevel aggregation node, but
the increase in complexity doesn't seem to have corresponding benefits
to me.
> Andres> Are those the two bigger controversial areas? Or are there
> Andres> others in your respective views?
>
> Another controversial item was the introduction of GroupedVar. The need
> for this can be avoided by explicitly setting to NULL the relevant
> columns of the representative group tuple when evaluating result rows,
> but (a) I don't think that's an especially clean approach (though I'm
> not pushing back very hard on it) and (b) the logic needed in its
> absence is different between the current chaining implementation and a
> possible union implementation, so I decided against making any changes
> on wasted-effort grounds.
Seems like fairly minor point to me. I very tentatively lean towards
setting the columns in the group tuple to NULL.
I've rebased the patch to
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=summary
branch int/grouping_sets . There were some minor conflicts.
What I dislike so far:
* Minor formatting things. Just going to fix and push the ones I
dislike.
* The Hopcroft-Karp stuff not being separate
* The increased complexity of grouping_planner. It'd imo be good if some
of that could be refactored into a separate function. Specifically the
else if (parse->hasAggs || (parse->groupingSets && parse->groupClause))
block.
* I think it'd not hurt to add rule deparse check for the function in
GROUPING SETS case. I didn't see one at least.
Do you have some nicer demo data set you worked with during development?
FWIW, expensive explains seem to be in stack traces like:
#25 0x00000000008a0ebe in get_variable (var=0x2326d90, levelsup=0, istoplevel=0 '\000', context=0x7ffcd4b386a0)
at /home/andres/src/postgresql/src/backend/utils/adt/ruleutils.c:5813
#26 0x00000000008a2ff6 in get_rule_expr (node=0x2326d90, context=0x7ffcd4b386a0, showimplicit=1 '\001')
at /home/andres/src/postgresql/src/backend/utils/adt/ruleutils.c:6933
#27 0x00000000008a0ebe in get_variable (var=0x2326338, levelsup=0, istoplevel=0 '\000', context=0x7ffcd4b386a0)
at /home/andres/src/postgresql/src/backend/utils/adt/ruleutils.c:5813
#28 0x00000000008a2ff6 in get_rule_expr (node=0x2326338, context=0x7ffcd4b386a0, showimplicit=1 '\001')
at /home/andres/src/postgresql/src/backend/utils/adt/ruleutils.c:6933
#29 0x00000000008a0ebe in get_variable (var=0x23258b0, levelsup=0, istoplevel=0 '\000', context=0x7ffcd4b386a0)
at /home/andres/src/postgresql/src/backend/utils/adt/ruleutils.c:5813
#30 0x00000000008a2ff6 in get_rule_expr (node=0x23258b0, context=0x7ffcd4b386a0, showimplicit=1 '\001')
at /home/andres/src/postgresql/src/backend/utils/adt/ruleutils.c:6933
#31 0x00000000008a0ebe in get_variable (var=0x2324e58, levelsup=0, istoplevel=0 '\000', context=0x7ffcd4b386a0)
at /home/andres/src/postgresql/src/backend/utils/adt/ruleutils.c:5813
#32 0x00000000008a2ff6 in get_rule_expr (node=0x2324e58, context=0x7ffcd4b386a0, showimplicit=1 '\001')
at /home/andres/src/postgresql/src/backend/utils/adt/ruleutils.c:6933
#33 0x00000000008a0ebe in get_variable (var=0x2324400, levelsup=0, istoplevel=0 '\000', context=0x7ffcd4b386a0)
at /home/andres/src/postgresql/src/backend/utils/adt/ruleutils.c:5813
#34 0x00000000008a2ff6 in get_rule_expr (node=0x2324400, context=0x7ffcd4b386a0, showimplicit=1 '\001')
at /home/andres/src/postgresql/src/backend/utils/adt/ruleutils.c:6933
#35 0x00000000008a0ebe in get_variable (var=0x2323978, levelsup=0, istoplevel=0 '\000', context=0x7ffcd4b386a0)
at /home/andres/src/postgresql/src/backend/utils/adt/ruleutils.c:5813
#36 0x00000000008a2ff6 in get_rule_expr (node=0x2323978, context=0x7ffcd4b386a0, showimplicit=1 '\001')
at /home/andres/src/postgresql/src/backend/utils/adt/ruleutils.c:6933
#37 0x00000000008a0ebe in get_variable (var=0x2322f20, levelsup=0, istoplevel=0 '\000', context=0x7ffcd4b386a0)
- several thousand frames deep. Something seems off here. It's all
below show_grouping_set_keys(), which in turn is below a couple
ExplainNode()s.
I think the problem is "just" that for each variable, in each grouping
set - a very large number in a large cube - we're recursing through the
whole ChainAggregate tree, as each Var just points to a var one level
lower.
It might be worthwhile to add a little hack that deparses the variables
agains the "lowest" relevant node (i.e. the one below the last chain
agg). Since they'll all have the same targetlist that ought to be safe.
I'll look some more, but dinner is calling now.
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2015-05-12 03:43:36 | Re: Final Patch for GROUPING SETS |
Previous Message | Peter Geoghegan | 2015-05-12 03:16:00 | Re: Minor ON CONFLICT related fixes |