From: | Jeff Davis <pgsql(at)j-davis(dot)com> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | 9.5: Memory-bounded HashAgg |
Date: | 2014-08-10 21:26:50 |
Message-ID: | 1407706010.6623.16.camel@jeff-desktop |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
This patch is requires the Memory Accounting patch, or something similar
to track memory usage.
The attached patch enables hashagg to spill to disk, which means that
hashagg will contain itself to work_mem even if the planner makes a
bad misestimate of the cardinality.
This is a well-known concept; there's even a Berkeley homework
assignment floating around to implement it -- in postgres 7.2, no
less. I didn't take the exact same approach as the homework assignment
suggests, but it's not much different, either. My apologies if some
classes are still using this as a homework assignment, but postgres
needs to eventually have an answer to this problem.
Included is a GUC, "enable_hashagg_disk" (default on), which allows
the planner to choose hashagg even if it doesn't expect the hashtable
to fit in memory. If it's off, and the planner misestimates the
cardinality, hashagg will still use the disk to contain itself to
work_mem.
One situation that might surprise the user is if work_mem is set too
low, and the user is *relying* on a misestimate to pick hashagg. With
this patch, it would end up going to disk, which might be
significantly slower. The solution for the user is to increase
work_mem.
Rough Design:
Change the hash aggregate algorithm to accept a generic "work item",
which consists of an input file as well as some other bookkeeping
information.
Initially prime the algorithm by adding a single work item where the
file is NULL, indicating that it should read from the outer plan.
If the memory is exhausted during execution of a work item, then
continue to allow existing groups to be aggregated, but do not allow new
groups to be created in the hash table. Tuples representing new groups
are saved in an output partition file referenced in the work item that
is currently being executed.
When the work item is done, emit any groups in the hash table, clear the
hash table, and turn each output partition file into a new work item.
Each time through at least some groups are able to stay in the hash
table, so eventually none will need to be saved in output partitions, no
new work items will be created, and the algorithm will terminate. This
is true even if the number of output partitions is always one.
Open items:
* costing
* EXPLAIN details for disk usage
* choose number of partitions intelligently
* performance testing
Initial tests indicate that it can be competitive with sort+groupagg
when the disk is involved, but more testing is required.
Feedback welcome.
Regards,
Jeff Davis
Attachment | Content-Type | Size |
---|---|---|
hashagg-disk-20140810.patch | text/x-patch | 39.4 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Noah Misch | 2014-08-10 21:35:30 | Re: Partitioning performance: cache stringToNode() of pg_constraint.ccbin |
Previous Message | Jeff Davis | 2014-08-10 20:50:16 | Re: 9.5: Better memory accounting, towards memory-bounded HashAgg |