Re: "pivot aggregation" with a patched intarray

From: Marc Mamin <M(dot)Mamin(at)intershop(dot)de>
To: "'pgsql-hackers(at)postgresql(dot)org'" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: "pivot aggregation" with a patched intarray
Date: 2014-05-30 16:21:16
Message-ID: B6F6FD62F2624C4C9916AC0175D56D8828A80FC4@jenmbs01.ad.intershop.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

(reposted, this time with attachment. sorry)

Hello,

(sorry for this long post)

I have patched intarray with 3 additional functions in order to count[distinct] event IDs
into arrays, whereas the array position correspond to the integer values. (mimic column oriented storage)

I need to use an array for the event IDs as I don't know how many of them exist, and the list may increase slowly upon the time.

The first results are quite promising.

Although this approach is only usable for a narrow set of use cases (*),
I wonder if I should look at other places in the source code to achieve a better implementation
and if there are discussions or plan to enhance Postgres with some support for such kind of storage (vectors of int).

(*) The array sizes should ideally not exceed a few tens
and the number of "events" should be unknown, otherwise using one column per event would be faster)

use case
========

c: category
s: session
e: event int4 (small range, mostly less than 20)

c s e
- - -
1 1 1
1 1 1
1 1 3
1 2 1
2 2 3
3 1 5

WITH SES AS (
SELECT s, c,
icount_to_array(e) ecount,
iarray_flag (e) edistinct
FROM foo
GROUP BY s, c)
SELECT c,
iarray_sum(ecount) ecount,
iarray_sum(edistinct)edistinct
FROM SES GROUP BY c


c ecount edistinct
- ------- ---------
1 [3,0,1] [2,0,1]
2 [0,0,1] [0,0,1]
3 [0,0,0,0,1] [0,0,0,0,1]

e.g.: the event 1 was met 3 times in the category 1, in 2 distinct sessions.

source code
===========
The 3 aggregates use following functions:

isumv: isumv([1,1,1],[0,0,1,3]) = [1,1,2,3])

iarray_addat : iarray_addat([3,3,3],2) = [3,4,3]

iarray_setat : iarray_setat([0,0],2) = [0,1]
iarray_setat([0,1],2) = [0,1]

I've added them at the end of _int_op.c from intarray (attached)

missing in code: checks for integer overflow and that the array lower bound are all 1.

These are my first C lines ever and I've never learned it.
Hence I'm open for critics ;-)
I've started with this great posting by Craig Ringer:
http://stackoverflow.com/questions/16992339/why-is-postgresql-array-access-so-much-faster-in-c-than-in-pl-pgsql?answertab=votes#tab-top
The rest is mostly copy and paste from other parts of intarray.

iarray_setat is just to set a "bitmap position" to 1, possibly enlarging it when required.
It is a trivial modification from iarray_addat
It should be possible to implement this more efficiently with bit[] or varbit, but this
lies beyond my C capbilities.

performances & advantage of icount_to_array
===========================================
As this aggregate allows to reduce the cardinality of GROUP BYs
it often performs better than a vertical approach.

With the horizontal storage, the result can be stored in smaller tables with much less rows,
hence bringing more advantages when it comes to indices.

e.g.:

select icount_to_array(user_id) from foo

{256655,0,0,1075,0,1,91154,36,0 (...)

Aggregate (cost=36930.44..36930.45 rows=1 width=4) (actual time=333.378..333.378 rows=1 loops=1)
-> Seq Scan on foo (cost=0.00..32279.95 rows=1860195 width=4) (actual time=0.010..131.117 rows=1860179 loops=1)
Total runtime: 333.420 ms

select user_id, count(*) from foo group by user_id order by 1

Sort (cost=41582.75..41582.87 rows=48 width=4) (actual time=492.638..492.638 rows=79 loops=1)
Sort Key: user_id
Sort Method: quicksort Memory: 28kB
-> HashAggregate (cost=41580.93..41581.41 rows=48 width=4) (actual time=492.606..492.615 rows=79 loops=1)
-> Seq Scan on foo (cost=0.00..32279.95 rows=1860195 width=4) (actual time=0.010..128.800 rows=1860179 loops=1)
Total runtime: 492.699 ms

1;256656
4;1075
7;91157
8;36
17;1455583
(...)

It will swap later on
---------------------
... which result in a significant difference in some cases.

create temp table ev AS SELECT * FROM (
generate_series(1,1000)s cross join
generate_series(1,500)c cross join
generate_series(1,15)e cross join
generate_series(1,5) repeat
)

explain analyze select s, c, icount_to_array(e) from ev group by s,c

HashAggregate (cost=830575.54..830975.54 rows=40000 width=12) (actual time=19188.384..19379.154 rows=500000 loops=1)
-> Seq Scan on ev (cost=0.00..561487.31 rows=35878431 width=12) (actual time=0.046..4849.977 rows=37500000 loops=1)
Total runtime: 19399.151 ms

explain analyze select s, c, e, count(*) from ev group by s,c,e

GroupAggregate (cost=5589186.88..6073545.71 rows=3587844 width=12) (actual time=63290.168..89336.265 rows=7500000 loops=1)
-> Sort (cost=5589186.88..5678882.96 rows=35878431 width=12) (actual time=63290.156..83981.772 rows=37500000 loops=1)
Sort Key: s, c, e
Sort Method: external merge Disk: 806424kB
-> Seq Scan on ev (cost=0.00..561487.31 rows=35878431 width=12) (actual time=0.039..4957.481 rows=37500000 loops=1)
Total runtime: 89680.844 ms

Aggregates definition:
======================

CREATE AGGREGATE icount_to_array(integer) (
SFUNC=iarray_addat,
STYPE=int4[],
INITCOND='{0}'
);

CREATE AGGREGATE iarray_flag(integer) (
SFUNC=iarray_setat,
STYPE=int4[],
INITCOND='{0}'
);

CREATE AGGREGATE iarray_sum(int[]) (
SFUNC=isumv,
STYPE=int[],
INITCOND='{0}'
);

regards,

Marc Mamin

Attachment Content-Type Size
_int_op.c.gz application/x-gzip 2.7 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2014-05-30 16:22:55 Re: jsonb access operators inefficiency
Previous Message Teodor Sigaev 2014-05-30 16:19:07 Re: SP-GiST bug.