Re: How to speedup intarray aggregate function?

From: "Dmitry Koterov" <dmitry(at)koterov(dot)ru>
To: Filip Rembiałkowski <plk(dot)zuber(at)gmail(dot)com>
Cc: "Postgres General" <pgsql-general(at)postgresql(dot)org>
Subject: Re: How to speedup intarray aggregate function?
Date: 2007-10-10 09:45:33
Message-ID: d7df81620710100245p7521ec5bv66ed8bcde41e3265@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Wow, seems I've found that!

* Beginning in PostgreSQL 8.1, the executor's AggState node is passed
as
* the fmgr "context" value in all transfunc and finalfunc calls. It
is
* not really intended that the transition functions will look into the
* AggState node, but they can use code like
* if (fcinfo->context && IsA(fcinfo->context, AggState))
* to verify that they are being called by nodeAgg.c and not as
ordinary
* SQL functions. The main reason a transition function might want to
know
* that is that it can avoid palloc'ing a fixed-size pass-by-ref
transition
* value on every call: it can instead just scribble on and return its
left
* input. Ordinarily it is completely forbidden for functions to
modify
* pass-by-ref inputs, but in the aggregate case we know the left input
is
* either the initial transition value or a previous function result,
and
* in either case its value need not be preserved. See int8inc() for
an
* example. Notice that advance_transition_function() is coded to
avoid a
* data copy step when the previous transition value pointer is
returned.

So theoretically I may create intarray_aggregate_push() function which, when
called by aggregate, does not reallocate & copy memory each time it is
called. Instead, it may allocate 1M memory at once (with gap), or enlarge
the memory segment by factor of 2 when it need to reallocate it (it is
O(log2) instead of O(N)).

And here is an example from the source code:

Datum
int8inc(PG_FUNCTION_ARGS)
{
if (fcinfo->context && IsA(fcinfo->context, AggState))
{
/*
* Special case to avoid palloc overhead for COUNT(): when called
from
* nodeAgg, we know that the argument is modifiable local storage,
so
* just update it in-place.
*
* Note: this assumes int8 is a pass-by-ref type; if we ever support
* pass-by-val int8, this should be ifdef'd out when int8 is
* pass-by-val.
*/
int64 *arg = (int64 *) PG_GETARG_POINTER(0);
int64 result;

result = *arg + 1;
/* Overflow check */
if (result < 0 && *arg > 0)
ereport(ERROR,
(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
errmsg("bigint out of range")));

*arg = result;
PG_RETURN_POINTER(arg);
}
...
}

On 10/10/07, Dmitry Koterov <dmitry(at)koterov(dot)ru> wrote:
>
> Thanks for your comment.
>
> I see two possible solution directions:
>
>
> 1. Is it possible to create C-function, which could accept something like
> ROWSET(ARRAY[]) in its input parameters?
> E.g. to call it as
>
> SELECT array_rowset_glue((SELECT arrayfield FROM arraytable));
>
> or something like this?
>
>
> 2. Is it possible to implement in C something like this?
>
> array_buffer_init();
> SELECT array_buffer_push(arrayfield) FROM arraytable;
> ids := array_buffer_get();
> array_buffer_free();
>
> where array_buffer_push() is an aggregate function which returns void,
> but, as its side-effect, appends arrayfield to the global array buffer for
> later acces with array_buffer_get().
>
>
> On 10/10/07, Filip Rembiałkowski <plk(dot)zuber(at)gmail(dot)com> wrote:
> >
> > 2007/10/10, Dmitry Koterov <dmitry(at)koterov(dot)ru>:
> > > Hello.
> > >
> > > I created an aggregate:
> > >
> > > CREATE AGGREGATE intarray_aggregate_push (_int4)
> > > (
> > > STYPE = _int4,
> > > SFUNC = intarray_push_array,
> > > INITCOND = '{}'
> > > );
> > >
> > > (or - I may use _int_union instead of intarray_push_array, its speed
> > is
> > > practically the same in my case).
> > > This aggregate merges together a list of integer[] arrays resulting
> > one big
> > > array with all elements.
> > >
> > > Then I want to use this aggregate:
> > >
> > > SELECT intarray_aggregate_push(arrayfield)
> > > FROM arraytable
> > >
> > > The table arraytable contains a lot of rows (about 5000), each row
> > has
> > > array with length of 5-10 elements, so - the resulting array should
> > contain
> > > about 50000 elements.
> > >
> > > The query is okay, but its speed is too bad: about 1 second.
> > >
> > > The main problem is the speed of intarray_aggregate_push function - it
> > is
> > > quite slow, because intarray_push_array reallocates the memory each
> > time I
> > > merge two arrays. I am pretty sure that the reallocaton and copying is
> > the
> > > bottleneck, because if I use another dummy aggreate:
> > >
> > > CREATE AGGREGATE intarray_aggregate_dummy (_int4)
> > > (
> > > STYPE = _int4,
> > > SFUNC = dummy,
> > > INITCOND = '{}'
> > > );
> > >
> > > CREATE OR REPLACE FUNCTION "public"."dummy" (a integer [], b integer
> > [])
> > > RETURNS integer [] AS
> > > $body$ BEGIN RETURN a; END; $body$
> > > LANGUAGE 'plpgsql' VOLATILE CALLED ON NULL INPUT SECURITY INVOKER;
> > >
> > > where dummy() is the function which returns its first argument without
> > any
> > > modification, the speed grows dramatically - about 25 ms (instead of
> > 1000
> > > ms!).
> > >
> > > The question is: how could I optimize this, and is it possible at all
> > in
> > > Postgres? I just want to get one large array glued from a lot of
> > smaller
> > > arrays...
> >
> >
> > 1. no wonder copying is the bottleneck - this is what the aggregate
> > does, mostly.
> >
> > 2. you can use plain array_cat for this, in my test it is few percent
> > faster
> >
> > 3. in this case I guess intarrray contrib is not an option, AFAIK it
> > was created only for speeding up searches, that is int4[] lookups
> >
> > 4. to have this kind of optimization you talk about, we would need an
> > aggregate operating (in this case appending) directly on
> > internalstate. i'm not sure if this is possible in postgres
> >
> > 5. my results:
> > your method (using intarray_push_array): 940 ms
> > using array_cat: 860 ms
> > same in PL/PgSQL: (LOOP, append) 800 ms
> > same thing in Perl, no database (push array of arrays into one and
> > print ): 18 ms
> >
> >
> > cheers, Filip
> >
> >
> > --
> > Filip Rembiałkowski
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 4: Have you searched our list archives?
> >
> > http://archives.postgresql.org/
> >
>
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Stefan Schwarzer 2007-10-10 09:55:31 Re: ORDER BY - problem with NULL values
Previous Message vladimir konrad 2007-10-10 09:41:05 Re: corrupt database?