Re: Final Patch for GROUPING SETS

From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Greg Stark <stark(at)mit(dot)edu>, Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Greg Sabino Mullane <greg(at)turnstep(dot)com>, Marti Raudsepp <marti(at)juffo(dot)org>, Noah Misch <noah(at)leadboat(dot)com>, Tomas Vondra <tv(at)fuzzy(dot)cz>, Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Svenne Krap <svenne(at)krap(dot)dk>
Subject: Re: Final Patch for GROUPING SETS
Date: 2015-04-30 04:35:26
Message-ID: 87oam6tjux.fsf@news-spur.riddles.org.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>>>>> "Andres" == Andres Freund <andres(at)anarazel(dot)de> writes:

Andres> This is not a real review. I'm just scanning through the
Andres> patch, without reading the thread, to understand if I see
Andres> something "worthy" of controversy. While scanning I might have
Andres> a couple observations or questions.

>> + * A list of grouping sets which is structurally equivalent to a ROLLUP
>> + * clause (e.g. (a,b,c), (a,b), (a)) can be processed in a single pass over
>> + * ordered data. We do this by keeping a separate set of transition values
>> + * for each grouping set being concurrently processed; for each input tuple
>> + * we update them all, and on group boundaries we reset some initial subset
>> + * of the states (the list of grouping sets is ordered from most specific to
>> + * least specific). One AGG_SORTED node thus handles any number of grouping
>> + * sets as long as they share a sort order.

Andres> Found "initial subset" not very clear, even if I probably
Andres> guessed the right meaning.

How about:

* [...], and on group boundaries we reset those states
* (starting at the front of the list) whose grouping values have
* changed (the list of grouping sets is ordered from most specific to
* least specific). One AGG_SORTED node thus handles any number [...]

>> + * To handle multiple grouping sets that _don't_ share a sort order, we use
>> + * a different strategy. An AGG_CHAINED node receives rows in sorted order
>> + * and returns them unchanged, but computes transition values for its own
>> + * list of grouping sets. At group boundaries, rather than returning the
>> + * aggregated row (which is incompatible with the input rows), it writes it
>> + * to a side-channel in the form of a tuplestore. Thus, a number of
>> + * AGG_CHAINED nodes are associated with a single AGG_SORTED node (the
>> + * "chain head"), which creates the side channel and, when it has returned
>> + * all of its own data, returns the tuples from the tuplestore to its own
>> + * caller.

Andres> This paragraph deserves to be expanded imo.

OK, but what in particular needs clarifying?

>> + * In order to avoid excess memory consumption from a chain of alternating
>> + * Sort and AGG_CHAINED nodes, we reset each child Sort node preemptively,
>> + * allowing us to cap the memory usage for all the sorts in the chain at
>> + * twice the usage for a single node.

Andres> What does reseting 'preemtively' mean?

Plan nodes are normally not reset (in the sense of calling ExecReScan)
just because they finished, but rather it's done before a subsequent new
scan is done. Doing the rescan call after all the sorted output has
been read means we discard the data from each sort node as soon as it is
transferred to the next one.

There is a more specific comment in agg_retrieve_chained where this
actually happens.

>> + * From the perspective of aggregate transition and final functions, the
>> + * only issue regarding grouping sets is this: a single call site (flinfo)
>> + * of an aggregate function may be used for updating several different
>> + * transition values in turn. So the function must not cache in the flinfo
>> + * anything which logically belongs as part of the transition value (most
>> + * importantly, the memory context in which the transition value exists).
>> + * The support API functions (AggCheckCallContext, AggRegisterCallback) are
>> + * sensitive to the grouping set for which the aggregate function is
>> + * currently being called.

Andres> Hm. I've seen a bunch of aggreates do this.

Such as? This was discussed about a year ago in the context of WITHIN
GROUP:

http://www.postgresql.org/message-id/87r424i24w.fsf@news-spur.riddles.org.uk

>> + * TODO: AGG_HASHED doesn't support multiple grouping sets yet.

Andres> Are you intending to resolve this before an eventual commit?

Original plan was to tackle AGG_HASHED after a working implementation
was committed; we figured that we'd have two commitfests to get the
basics right, and then have a chance to get AGG_HASHED done for the
third one. Also, there was talk of other people working on hashagg
memory usage issues, and we didn't want to conflict with that.

Naturally the extended delays rather put paid to that plan. Going ahead
and writing code for AGG_HASHED anyway wasn't really an option, since
with the overall structural questions unresolved there was too much
chance of it being wasted effort.

Andres> Possibly after the 'structural' issues are resolved? Or do you
Andres> think this can safely be put of for another release?

I think the feature is useful even without AGG_HASHED, even though that
means it can sometimes be beaten on performance by using UNION ALL of
many separate GROUP BYs; but I'd defer to the opinions of others on that
point.

Andres> Maybe it's just me, but I get twitchy if I see a default being
Andres> used like this. I'd much, much rather see the two remaining
Andres> AGG_* types and get a warning from the compiler if a new one is
Andres> added.

Meh. It also needs a bogus initialization of "result" to avoid compiler
warnings if done that way.

Andres> I'll bet this will look absolutely horrid after a pgindent run :/

pgindent doesn't touch it, I just checked.

[making CUBE and ROLLUP work without being reserved]

Andres> This is somewhat icky. I've not really thought abuot this very
Andres> much, but it's imo something to pay attention to.

This one was discussed in December or so - all the arguments were
thrashed out then.

The bottom line is that reserving "cube" is really painful due to
contrib/cube, and of the possible workarounds, using precedence rules to
resolve it in the grammar is already being used for some other
constructs.

Andres> So, having quickly scanned through the patch, do I understand
Andres> correctly that the contentious problems are:

Andres> * Arguably this converts the execution *tree* into a DAG. Tom
Andres> seems to be rather uncomfortable with that. I am wondering
Andres> whether this really is a big deal - essentially this only
Andres> happens in a relatively 'isolated' part of the tree right?
Andres> I.e. if those chained together nodes were considered one node,
Andres> there would not be any loops? Additionally, the way
Andres> parametrized scans works already essentially "violates" the
Andres> tree paradigma somewhat.

The major downsides as I see them with the current approach are:

1. It makes plans (and hence explain output) nest very deeply if you
have complex grouping sets (especially cubes with high dimensionality).

This can make explain very slow in the most extreme cases (explain seems
to perform about O(N^3) in the nesting depth of the plan, I don't know
why). But it's important not to exaggerate this effect: if anyone
actually has a real-world example of a 12-d cube I'll eat the headgear
of their choice, and for an 8-d cube the explain overhead is only on the
order of ~40ms. (A 12-d cube generates more than 35 megs of explain
output, nested about 1850 levels deep, and takes about 45 seconds to
explain, though only about 200ms to plan.)

In more realistic cases, the nesting isn't too bad (4-d cube gives a
12-deep plan: 6 sorts and 6 agg nodes), but it's still somewhat less
readable than a union-based plan would be. But honestly I think that
explain output aesthetics should not be a major determining factor for
implementations.

2. A union-based approach would have a chance of including AGG_HASHED
support without any significant code changes, just by using one HashAgg
node per qualifying grouping set. However, this would be potentially
significantly slower than teaching HashAgg to do multiple grouping sets,
and memory usage would be an issue.

(The current approach is specifically intended to allow the use of an
AGG_HASHED node as the head of the chain, once it has been extended to
support multiple grouping sets.)

Andres> There still might be better representations, about which I want
Andres> to think, don't get me wrong. I'm "just" not seing this as a
Andres> fundamental problem.

The obvious alternative is this:

-> CTE x
-> entire input subplan here
-> Append
-> GroupAggregate
-> Sort
-> CTE Scan x
-> GroupAggregate
-> Sort
-> CTE Scan x
-> HashAggregate
-> CTE Scan x
[...]

which was basically what we expected to do originally. But all of the
existing code to deal with CTE / CTEScan is based on the assumption that
each CTE has a rangetable entry before planning starts, and it is
completely non-obvious how to arrange to generate such CTEs on the fly
while planning. Tom offered in December to look into that aspect for
us, and of course we've heard nothing about it since.

Andres> * The whole grammar/keyword issue. To me this seems to be a
Andres> problem of swallowing one out of several similarly coloured
Andres> poisonous pills.

Right. Which is why this issue was thrashed out months ago and the
current approach decided on. I consider this question closed.

Andres> Are those the two bigger controversial areas? Or are there
Andres> others in your respective views?

Another controversial item was the introduction of GroupedVar. The need
for this can be avoided by explicitly setting to NULL the relevant
columns of the representative group tuple when evaluating result rows,
but (a) I don't think that's an especially clean approach (though I'm
not pushing back very hard on it) and (b) the logic needed in its
absence is different between the current chaining implementation and a
possible union implementation, so I decided against making any changes
on wasted-effort grounds.

--
Andrew (irc:RhodiumToad)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Chris Rogers 2015-04-30 04:44:04 Re: CTE optimization fence on the todo list?
Previous Message Amit Kapila 2015-04-30 04:31:31 Re: Reducing tuple overhead