Re: The Future of Aggregation

From: Kevin Grittner <kgrittn(at)ymail(dot)com>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, "kevin(dot)grittner(at)enterprisedb(dot)com" <kevin(dot)grittner(at)enterprisedb(dot)com>, "amit(dot)kapila(at)enterprisedb(dot)com" <amit(dot)kapila(at)enterprisedb(dot)com>, Simon Riggs <simon(dot)riggs(at)2ndquadrant(dot)com>
Subject: Re: The Future of Aggregation
Date: 2015-06-09 14:52:51
Message-ID: 2120479484.7962669.1433861571649.JavaMail.yahoo@mail.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

David Rowley <david(dot)rowley(at)2ndquadrant(dot)com> wrote:

> It appears to me that there's quite a few new features and
> optimisations on the not too distant horizon which will require
> adding yet more fields into pg_aggregate.
>
> These are things along the lines of:

> 3. Auto-updating Materialized views (ones which contain aggregate
> functions)

Yes, that's certainly on the road map. The recent work to add
support for inverse transition functions already goes a huge way
toward allowing optimization of incremental maintenance of
aggregates in materialized views.

> Items 1-4 above I believe require support of "Aggregate State
> Combine Support" -> https://commitfest.postgresql.org/5/131/

Yeah, that seems likely to extend optimized cases even further.

> 5. Dependant Aggregates

> Item 5 makes items 1-4 a bit more complex as with this item
> there's opportunity for very good performance improvements by
> allowing aggregates like AVG(x) also perform all the required
> work to allow SUM(x) and COUNT(x) to be calculated for "free" in
> a query containing all 3 aggregates.

Not only CPU is saved, but the optimizations for materialized views
would require the aggregate function's transition state to be saved
in each row, and the duplicate state information among these
functions would be a waste of space.

> I've discussed item 5 off-list with Simon and he mentioned that
> we might invent a transition state and transition functions which
> can have parts switched on and off much like how calcSumX2
> controls if do_numeric_accum() should calculate sum(x*x) or not.
> The problem with this is that if we ever want to store aggregate
> states in an auto-updating materialized view, then this generic
> aggregate state will have to contain at least 3 fields (to store
> count(x), sum(x) and sum(x*x)), and those 3 fields would have to
> be stored even if the aggregate was just a simple count(*).

Yeah, I think we want to preserve the ability of count() to have a
simple state, and implement dependent aggregates as discussed in
the other thread -- where (as I understood it) having sum(x),
count(x), and avg(x) in a query would avoid the row-by-row work for
sum(x) and count(x), and just invoke a final function to extract
those values from the transition state of the avg(x) aggregate. I
see incremental maintenance of materialized views taking advantage
of the same sort of behavior, only maintaining the state for avg(x)
during incremental maintenance and *at the end* pulling the values
for sum(x) and count(x) out of that.

> The idea I discussed in the link in item 5 above gets around this
> problem, but it's a perhaps more surprise filled implementation
> as it will mean "select avg(x),sum(x),count(x) from t" is
> actually faster than "select sum(x),count(x) from t" as the agg
> state for avg() will satisfy sum and count too.

I'm skeptical that it will be noticeably faster. It's easy to see
why this optimization will make a query *with all three* faster,
but I would not expect the process of accumulating the sum and
count to be about the same speed whether performed by one
transition function or two. Of course I could be persuaded by a
benchmark showing otherwise.

> The purpose of this email is to open the doors for discussion
> about this so that nobody goes off and develops feature X into a
> corner and disallows feature Y and so that we end up with the
> most optimal solution that does not trip us up in the future.

That laudable. So far I don't see anything that will do anything
but make the materialized view maintenance easier, at least if
dependent aggregates are implemented as described in the other
thread.

> I'm interested to hear if Kevin or Amit have had any time to give
> this any thought before. It would be good to ensure we all have
> the same vision here.

It sounds great to me. I had thought about the need to deal with
these issues at some point to allow optimizations to the
incremental maintenance of materialized views (there needs to be a
fall-back of recalculating from scratch where such optimizations
are not possible, but people will not be happy with the performance
of that for simple cases where it is intuitively clear that we
could do better). You have developed the ideas farther than I
have, and (at least on a first review) I like what I'm seeing.
There may be some devils in the details down the road, but I would
say that what you're doing looks like it will dovetail nicely with
what's on my road map.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2015-06-09 15:00:10 Re: The Future of Aggregation
Previous Message Tom Lane 2015-06-09 14:26:40 Re: Aggregate Supporting Functions