Quick Links

proposal - GROUPING SETS

From:	"Pavel Stehule" <pavel(dot)stehule(at)gmail(dot)com>
To:	"PostgreSQL-development Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	proposal - GROUPING SETS
Date:	2008-09-16 10:27:20
Message-ID:	162867790809160327n23dd2f0eyd0f6ff6365afa25@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hello,

== Proposal - GROUPING SETS ==

a grouping set feature allows multiple grouping clauses in one
query. Result of grouping sets is union of results each groupby
clause's result.

create table t(a int, b int);
insert into t values(10,20);
insert into t values(30,40);

select a, b from t group by grouping sets(a, b);

is same as:

select a, NULL from t group by a
union all
select NULL, b from t group by b;

Note: all ungrouped vars are transformed to NULL

Groupby clause should contains cube and rollup lists.
These are transformed to grouping sets via transformed rules:

create table t1(a int, b int, c int)

group by rollup(a, b, c)
->
group by grouping sets((a,b,c), (a,b), (a), ())

group by cube(a,b,c)
->
group by grouping sets ((a,b,c),(a,b),(a,c), (a), (b,c),
(b), (c), ())

Groupby clause or grouping sets should contains more sets.
Result is multiplication of these sets:

group by grouping sets(a), grouping sets(b, (b,c), ())
->
group by grouping sets((a,b), (a,b,c), (a))

When grouping sets are used, then we should to use grouping
and grouping_id functions. Function grouping returns 1 when
parameter is in current group set, else returns 0. Function
grouping_id returns value as

grouping_id(a,b,c) =
to_dec(to_bin(grouping(a) || grouping(b) || grouping(c)))

postgres=# select * from t;
a | b | c
----+----+----
10 | 20 | 30
10 | 20 | 30
(2 rows)

postgres=# select a,b,c from t group by grouping sets(a,b,c);
a | b | c
----+----+----
10 | |
| 20 |
| | 30
(3 rows)

postgres=# select a,b,c, grouping(a), grouping(b), grouping(c),
grouping_id(a,b,c) from t group by grouping sets(a,b,c);
a | b | c | grouping | grouping | grouping | grouping_id
----+----+----+----------+----------+----------+-------------
10 | | | 1 | 0 | 0 | 4
| 20 | | 0 | 1 | 0 | 2
| | 30 | 0 | 0 | 1 | 1
(3 rows)

some real sample:

create table report(
inserted date,
locality varchar,
name varchar,
c int);

postgres=# copy report to stdout;
2008-10-10 Prague Milk 10
2008-10-11 Prague Milk 12
2008-10-10 Prague Rum 2
2008-10-11 Prague Rum 6
2008-10-10 Berlin Milk 8
2008-10-11 Berlin Milk 14
2008-10-10 Berlin Beer 20
2008-10-11 Berlin Beer 25

postgres=# select inserted, locality, name, sum(c)
from report
group by grouping sets(inserted, locality, name);
inserted | locality | name | sum
------------+----------+------+-----
2008-10-10 | | | 40
2008-10-11 | | | 57
| Berlin | | 67
| Prague | | 30
| | Milk | 44
| | Rum | 8
| | Beer | 45
(7 rows)

postgres=# select inserted, locality, name, sum(c)
from report
group by name, grouping sets(inserted, locality);
inserted | locality | name | sum
------------+----------+------+-----
2008-10-11 | | Rum | 6
2008-10-10 | | Rum | 2
2008-10-10 | | Beer | 20
2008-10-11 | | Beer | 25
2008-10-10 | | Milk | 18
2008-10-11 | | Milk | 26
| Prague | Rum | 8
| Berlin | Beer | 45
| Berlin | Milk | 22
| Prague | Milk | 22
(10 rows)

== Implementation ==
Grouping sets introduce a new concept into SQL. One readed
tuple is multiple used. It's similar with WITH clause. It's little
bit dificult implement it for current PostgreSQL's executor.

In my prototype I used aux node Feeder. This node should to hold
only one tuple. I add new method for Agg node, that process only
one input tuple:

for (;;)
{
tuple = feeder->execute(grouping_sets->lefttree);
foreach(l, grouping_sets->subplans)
{
lfirst(l)->process(tuple);
}
if (is_null(tuple))
break;
}

It supports only HASH Agg nodes. For non hash agg should be used
other method (currently it isn't supported).

postgres=# explain verbose select inserted, locality, name, sum(c)
from report
group by name,grouping sets(inserted, locality);
QUERY PLAN
-----------------------------------------------------------------
Grouping Sets (cost=12.00..35.00 rows=400 width=68)
Output: NULL::date, locality, name, sum(c)
-> Seq Scan on report (cost=0.00..18.00 rows=800 width=68)
Output: inserted, locality, name, c
-> HashAggregate (cost=6.00..8.50 rows=200 width=68)
Output: inserted, NULL::character varying, name, sum(c)
-> Feeder (cost=0.00..0.00 rows=800 width=68)
Output: inserted, locality, name, c
-> HashAggregate (cost=6.00..8.50 rows=200 width=68)
Output: NULL::date, locality, name, sum(c)
-> Feeder (cost=0.00..0.00 rows=800 width=68)
Output: inserted, locality, name, c
(12 rows)

After work on prototype I don't see any problems in executor or
parser. I expect some dificulties in planner. With grouping sets
grouping_planner procedure will be much more complex:

1. targetlist and groupclause will be list of list,
2. we should to repeat estimation of NumGroups for each group
set (it's should be shared with CTE feature??).

== Parser problems ==
1. The identifier cube is used in contrib cube.
Solution: CUBE '(' ... ')' generates funcCall and it is transformed
to grouping sets only in groupby clause later.

I invite any ideas, notes and help with documentation.

Regards
Pavel Stehule

Responses

Re: proposal - GROUPING SETS at 2008-09-16 14:02:46 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Simon Riggs	2008-09-16 11:53:43	Subtransaction commits and Hot Standby
Previous Message	Marko Kreen	2008-09-16 10:26:17	Re: [Review] fix dblink security hole