Re: Combining Aggregates

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Amit Kapila <amit(dot)kapila(at)enterprisedb(dot)com>
Subject: Re: Combining Aggregates
Date: 2016-01-18 02:26:30
Message-ID: CAKJS1f-DrQztzZHxHc74fgKO3cAKoEf4QocGv+YUWr2xr6=b7w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 18 January 2016 at 14:36, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
wrote:

> On Sat, Jan 16, 2016 at 12:00 PM, David Rowley
> <david(dot)rowley(at)2ndquadrant(dot)com> wrote:
> > On 16 January 2016 at 03:03, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> >>
> >> On Tue, Dec 29, 2015 at 7:39 PM, David Rowley
> >> <david(dot)rowley(at)2ndquadrant(dot)com> wrote:
> >> >> No, the idea I had in mind was to allow it to continue to exist in
> the
> >> >> expanded format until you really need it in the varlena format, and
> >> >> then serialize it at that point. You'd actually need to do the
> >> >> opposite: if you get an input that is not in expanded format, expand
> >> >> it.
> >> >
> >> > Admittedly I'm struggling to see how this can be done. I've spent a
> good
> >> > bit
> >> > of time analysing how the expanded object stuff works.
> >> >
> >> > Hypothetically let's say we can make it work like:
> >> >
> >> > 1. During partial aggregation (finalizeAggs = false), in
> >> > finalize_aggregates(), where we'd normally call the final function,
> >> > instead
> >> > flatten INTERNAL states and store the flattened Datum instead of the
> >> > pointer
> >> > to the INTERNAL state.
> >> > 2. During combining aggregation (combineStates = true) have all the
> >> > combine
> >> > functions written in such a ways that the INTERNAL states expand the
> >> > flattened states before combining the aggregate states.
> >> >
> >> > Does that sound like what you had in mind?
> >>
> >> More or less. But what I was really imagining is that we'd get rid of
> >> the internal states and replace them with new datatypes built to
> >> purpose. So, for example, for string_agg(text, text) you could make a
> >> new datatype that is basically a StringInfo. In expanded form, it
> >> really is a StringInfo. When you flatten it, you just get the string.
> >> When somebody expands it again, they again have a StringInfo. So the
> >> RW pointer to the expanded form supports append cheaply.
> >
> >
> > That sounds fine in theory, but where and how do you suppose we determine
> > which expand function to call? Nothing exists in the catalogs to decide
> this
> > currently.
>
> I am thinking of transition function returns and accepts the StringInfoData
> instead of PolyNumAggState internal data for int8_avg_accum for example.
>

hmm, so wouldn't that mean that the transition function would need to (for
each input tuple):

1. Parse that StringInfo into tokens.
2. Create a new aggregate state object.
3. Populate the new aggregate state based on the tokenised StringInfo, this
would perhaps require that various *_in() functions are called on each
token.
4. Add the new tuple to the aggregate state.
5. Build a new StringInfo based on the aggregate state modified in 4.

?

Currently the transition function only does 4, and performs 2 only if it's
the first Tuple.

Is that what you mean? as I'd say that would slow things down significantly!

To get a gauge on how much more CPU work that would be for some aggregates,
have a look at how simple int8_avg_accum() is currently when we have
HAVE_INT128 defined. For the case of AVG(BIGINT) we just really have:

state->sumX += newval;
state->N++;

The above code is step 4 only. So unless I've misunderstood you, that would
need to turn into steps 1-5 above. Step 4 here is probably just a handful
of instructions right now, but adding code for steps 1,2,3 and 5 would turn
that into hundreds.

I've been trying to avoid any overhead by adding the serializeStates flag
to make_agg() so that we can maintain the same performance when we're just
passing internal states around in the same process. This keeps the
conversions between internal state and serialised state to a minimum.

The StringInfoData is formed with the members of the PolyNumAggState
> structure data. The input given StringInfoData is transformed into
> PolyNumAggState data and finish the calculation and again form the
> StringInfoData and return. Similar changes needs to be done for final
> functions input type also. I am not sure whether this approach may have
> some impact on performance?

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2016-01-18 02:31:59 Re: Additional role attributes && superuser review
Previous Message Stephen Frost 2016-01-18 02:23:14 Re: Additional role attributes && superuser review