From: | Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Hash support for grouping sets |
Date: | 2017-01-06 03:48:58 |
Message-ID: | 87vatszyhj.fsf@news-spur.riddles.org.uk |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Herewith a patch for doing grouping sets via hashing or mixed hashing
and sorting.
The principal objective is to pick whatever combination of grouping sets
has an estimated size that fits in work_mem, and minimizes the number of
sorting passes we need to do over the data, and hash those. (Yes, this
is a knapsack problem.)
This is achieved by adding another strategy to the Agg node; AGG_MIXED
means that both hashing and (sorted or plain) grouping happens. In
addition, support for multiple hash tables in AGG_HASHED mode is added.
Sample plans look like this:
explain select two,ten from onek group by cube(two,ten);
QUERY PLAN
--------------------------------------------------------------
MixedAggregate (cost=0.00..89.33 rows=33 width=8)
Hash Key: two, ten
Hash Key: two
Hash Key: ten
Group Key: ()
-> Seq Scan on onek (cost=0.00..79.00 rows=1000 width=8)
explain select two,ten from onek group by two, cube(ten,twenty);
QUERY PLAN
---------------------------------------------------------------
HashAggregate (cost=86.50..100.62 rows=162 width=12)
Hash Key: two, ten, twenty
Hash Key: two, ten
Hash Key: two
Hash Key: two, twenty
-> Seq Scan on onek (cost=0.00..79.00 rows=1000 width=12)
set work_mem='64kB';
explain select count(*) from tenk1
group by grouping sets (unique1,thousand,hundred,ten,two);
QUERY PLAN
------------------------------------------------------------------------
MixedAggregate (cost=1535.39..3023.89 rows=11112 width=28)
Hash Key: two
Hash Key: ten
Hash Key: hundred
Group Key: unique1
Sort Key: thousand
Group Key: thousand
-> Sort (cost=1535.39..1560.39 rows=10000 width=20)
Sort Key: unique1
-> Seq Scan on tenk1 (cost=0.00..458.00 rows=10000 width=20)
(10 rows)
Columns with unhashable or unsortable data types are handled
appropriately, though obviously every individual grouping set must end
up containing compatible column types.
There is one current weakness which I don't see a good solution for: the
planner code still has to pick a single value for group_pathkeys before
planning the input path. This means that we sometimes can't choose a
minimal set of sorts, because we may have chosen a sort order for a
grouping set that should have been hashed, for example:
explain select count(*) from tenk1
group by grouping sets (two,unique1,thousand,hundred,ten);
QUERY PLAN
------------------------------------------------------------------------
MixedAggregate (cost=1535.39..4126.28 rows=11112 width=28)
Hash Key: ten
Hash Key: hundred
Group Key: two
Sort Key: unique1
Group Key: unique1
Sort Key: thousand
Group Key: thousand
-> Sort (cost=1535.39..1560.39 rows=10000 width=20)
Sort Key: two
-> Seq Scan on tenk1 (cost=0.00..458.00 rows=10000 width=20)
(3 sorts, vs. 2 in the previous example that listed the same grouping
sets in different order)
Current status of the work: probably still needs cleanup, maybe some
better regression tests, and Andres has expressed some concern over the
performance impact of additional complexity in advance_aggregates; it
would be useful if this could be tested. But it should be reviewable
as-is.
--
Andrew (irc:RhodiumToad)
Attachment | Content-Type | Size |
---|---|---|
gshash10.patch | text/x-patch | 140.7 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Jim Nasby | 2017-01-06 03:50:52 | Re: Faster methods for getting SPI results |
Previous Message | Tsunakawa, Takayuki | 2017-01-06 03:01:57 | Re: Supporting huge pages on Windows |