From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk> |
Cc: | Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Greg Sabino Mullane <greg(at)turnstep(dot)com>, Marti Raudsepp <marti(at)juffo(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Tomas Vondra <tv(at)fuzzy(dot)cz>, Atri Sharma <atri(dot)jiit(at)gmail(dot)com> |
Subject: | Re: Final Patch for GROUPING SETS |
Date: | 2014-12-12 21:33:38 |
Message-ID: | 8479.1418420018@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk> writes:
> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> Tom> That seems pretty messy, especially given your further comments
> Tom> that these plan nodes are interconnected and know about each
> Tom> other (though you failed to say exactly how).
> I'd already explained in more detail way back when we posted the
> patch. But to reiterate: the ChainAggregate nodes pass through their
> input data unchanged, but on group boundaries they write aggregated
> result rows to a tuplestore shared by the whole chain. The top node
> returns the data from the tuplestore after its own output is
> completed.
That seems pretty grotty from a performance+memory consumption standpoint.
At peak memory usage, each one of the Sort nodes will contain every input
row, and the shared tuplestore will contain every output row. That will
lead to either a lot of memory eaten, or a lot of temp-file I/O, depending
on how big work_mem is.
In principle, with the CTE+UNION approach I was suggesting, the peak
memory consumption would be one copy of the input rows in the CTE's
tuplestore plus one copy in the active branch's Sort node. I think a
bit of effort would be needed to get there (ie, shut down one branch's
Sort node before starting the next, something I'm pretty sure doesn't
happen today). But it's doable whereas I don't see how we dodge the
multiple-active-sorts problem with the chained implementation.
> Tom> ISTM that maybe what we should do is take a cue from the SQL
> Tom> spec, which defines these things in terms of UNION ALL of
> Tom> plain-GROUP-BY operations reading from a common CTE.
> I looked at that, in fact that was our original plan, but it became
> clear quite quickly that it was not going to be easy.
> I tried two different approaches. First was to actually re-plan the
> input (i.e. running query_planner more than once) for different sort
> orders; that crashed and burned quickly thanks to the extent to which
> the planner assumes that it'll be run once only on any given input.
Well, we'd not want to rescan the input multiple times, so I don't think
that generating independent plan trees for each sort order would be the
thing to do anyway. I suppose ideally it would be nice to check the costs
of getting the different sort orders, so that the one Sort we elide is the
one that gets the best cost savings. But the WindowAgg code isn't that
smart either and no one's really complained, so I think this can wait.
(Eventually I'd like to make such cost comparisons possible as part of the
upper-planner Pathification that I keep nattering about. But it doesn't
seem like a prerequisite for getting GROUPING SETS in.)
> Second was to generate a CTE for the input data. This didn't get very
> far because everything that already exists to handle CTE nodes assumes
> that they are explicit in the planner's input (that they have their
> own Query node, etc.) and I was not able to determine a good solution.
Seems like restructuring that wouldn't be *that* hard. We probably don't
want it to be completely like a CTE for planning purposes anyway --- that
would foreclose passing down any knowledge of desired sort order, which
we don't want. But it seems like we could stick a variant of CtePath
atop the chosen result path of the scan/join planning phase. If you like
I can poke into this a bit.
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2014-12-12 21:40:01 | Re: [REVIEW] Re: Compression of full-page-writes |
Previous Message | Jim Nasby | 2014-12-12 21:28:33 | Re: On partitioning |