Re: Planning aggregates which require sorted or distinct

From: Gavin Sherry <swm(at)alcove(dot)com(dot)au>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Planning aggregates which require sorted or distinct
Date: 2007-01-20 12:54:51
Message-ID: Pine.LNX.4.58.0701202319320.29254@linuxworld.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, 20 Jan 2007, Simon Riggs wrote:

> On Sat, 2007-01-20 at 15:58 +1100, Gavin Sherry wrote:
> > On Fri, 19 Jan 2007, Tom Lane wrote:
> >
> > > Gavin Sherry <swm(at)alcove(dot)com(dot)au> writes:
> > > > On Fri, 19 Jan 2007, Tom Lane wrote:
> > > >> Er, what primary key would that be exactly? And even if you had a key,
> > > >> I wouldn't call joining on it trivial; I'd call it expensive ...
> > >
> > > > I should have used slightly different language. What I meant to say was,
> > > > both sets are primarily sorted by saledate so they can be merged back
> > > > together. This is why I said it was trivial.
>
> Yes, your fan out plan sounds best, together with the optimisation to
> remove whatever you call the individual strands of the fan-out. Think of
> a good name now, so we can discuss it more easily...

Hah. Yes. I've been using the term 'subplan' but it isn't a good one. Can
we just choose something 'cool' like iPlan2006? ;)

> > Yep. I was thinking about this all morning. I think I've over engineered
> > the problem in my head. Window function input just looks like a slightly
> > more complex distinct aggregate input. I'll think on it more though.
>
> I'm working on modifying Tuplestore that will allow you to store the
> last N tuples, rather than all previous input. This is specifically
> designed to allow Merge Join to do Mark/Restore without materializing
> the whole sort set. This can also be used to materialize Windows (i.e.
> <window clause> in SQL:2003), so you can store the current row plus n
> PRECEDING and/or n FOLLOWING rows as appropriate. Reading from the
> Window would then be similar-ish to doing a Mark/Restore pair, which we
> can rename to MarkWindowStart and ReturnToWindowStart.

Wow. What a coincidence! Windows are slightly more complex though. As you
probably know, there are two ways of specifying the window frame: by an
absolute number of rows (ROWS N PRECEDING, for example); or, by a 'range'
(RANGE N PRECEDING), where the range, in the case of 'preceding', is
determined by subtracted the range parameter from the value of the current
field -- i.e., the window attribute.

This presents a problem in terms of managing the size of the buffer. If
you have data clustered around a value then the amount of data input into
the window function when we cross this value explodes. The tuplestore code
can or will deal with this but it is currently not designed for this kind
of usage pattern. That is, every time we advance a row we must (a)
recalculate multiset to be input to the window function and (b) generate
the value from the window function by passing the input to it. The problem
arises when the window contains more data than can be stored in work_mem.
Then, each time we need to recalculate the window function, we'll churn
data through the buffer and get no effect from the buffering itself.

A lot of the research around window functions recommends offseting this
effect by buffering data 'pre-aggregated' for each distinct value in the
buffer. Most of the research, however, relies on a trick we don't have
available in the SQL window function spec: windows in SQL can have a
partition (irrelevant here), data sort order and a range; windows in the
world of window function streaming data research have this and a 'slide'.
Slide is the interval at which the aggregate is regenerated and in SQL the
value is regenerated for every new row. The research concedes that at this
level, preaggregation either stops making sense of is very complex.

I've come up with a way to be able to do it, but not for all aggregates
(median() for example). I'll discuss this in the proposal to be sent
through soon. The problem is, the requirements we have for buffering data
around window functions could be very complex.

> I'll present the prototype shortly, I've got a few bugs, but the basic
> principle is working already. I'm happy to craft that to your exact
> needs, so that you'll be able to press ahead with the rest of the
> implementation of Windowed functions.

Awesome. I will get the proposal off so that we can discuss the
requirements at length.

> The Materialize node is almost unchanged, but I was thinking of renaming
> it when used in this way to make the EXPLAIN more readable. Perhaps we
> should call it a Materialized Window for both the MJ and Window function
> cases.

I think that would be confusing in the case of MJ.

> This won't help with UNBOUNDED window definitions, but I imagine that
> these are best handled by running aggregates which would be an O(N)
> operation, rather than recalculating everything each time, which would
> be O(N^2).

Correct. You only need to recalculate the aggregate if it has a moving
window frame.

> > To bring out a slightly different point -- and I know this is putting the
> > cart before the horse -- but window functions will (potentially) output
> > rows in the wrong order. I made a passing reference to this earlier. For
> > example, say we have a table employees with the following data:
> >
> > empno | salary | age
> > ====================
> > 1 | 2000 | 50
> > 2 | 6000 | 30
> > 3 | 3000 | 20
> >
> > We want to answer the following: for each employee: what is their rank in
> > terms of salary and what is their rank in terms of age. This query
> > answers that:
> >
> > select empno, rank() over (order by salary) as srank,
> > rank() over (order by age) as arank
> > from employees order by empno;
> >
> > The result will be:
> >
> > empno | salary | age
> > ====================
> > 1 | 1 | 3
> > 2 | 3 | 2
> > 3 | 2 | 1
> >
> > Both window functions provide results based on the order of their input.
> > So, in terms of empno, srank will output in this order: empno = 1, 3, 2;
> > arank will output in this order: empno = 3, 2, 1. We need to glue these
> > back together and the only way I can think how is via a synthetic key.
>
> Anything wrong with using empno?

It might not be a unique key.

>
> > Ideally, the planner would have some input on how to clue about how large
> > the result set will be and the orders from the window functions so that it
> > can decide whether to use nested loop, merge join or hash join to do it.
> >
> > Can you think of a different approach?
>
> Sounds like figuring out and agreeing the executor issues first is the
> best bet. Once we know whats to be done, extending the planner to do it
> will be easier.

Exactly. As you can see, it's a pretty big topic so I'll work further on
the proposal and send it off.

Thanks,

Gavin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Gavin Sherry 2007-01-20 13:19:26 Re: Planning aggregates which require sorted or distinct
Previous Message Alvaro Herrera 2007-01-20 12:18:15 Re: Planning aggregates which require sorted or distinct