From: | Jeff Davis <pgsql(at)j-davis(dot)com> |
---|---|
To: | Justin Pryzby <pryzby(at)telsasoft(dot)com> |
Cc: | Andres Freund <andres(at)anarazel(dot)de>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Taylor Vesely <tvesely(at)pivotal(dot)io>, Adam Lee <ali(at)pivotal(dot)io>, Melanie Plageman <mplageman(at)pivotal(dot)io>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Memory-Bounded Hash Aggregation |
Date: | 2020-03-18 23:35:57 |
Message-ID: | 2688fb94f3d0e571768bf851b7fde0eae780e898.camel@j-davis.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Committed.
There's some future work that would be nice (some of these are just
ideas and may not be worth it):
* Refactor MemoryContextMemAllocated() to be a part of
MemoryContextStats(), but allow it to avoid walking through the blocks
and freelists.
* Improve the choice of the initial number of buckets in the hash
table. For this patch, I tried to preserve the existing behavior of
estimating the number of groups and trying to initialize with that many
buckets. But my performance tests seem to indicate this is not the best
approach. More work is needed to find what we should really do here.
* For workloads that are not in work_mem *or* system memory, and need
to actually go to storage, I see poor CPU utilization because it's not
effectively overlapping CPU and IO work. Perhaps buffering or readahead
changes can improve this, or async IO (even better).
* Project unnecessary attributes away before spilling tuples to disk.
* Improve logtape.c API so that the caller doesn't need to manage a
bunch of tape numbers.
* Improve estimate of the hash entry size. This patch doesn't change
the way the planner estimates it, but I observe that actual size as
seen at runtime is significantly different. This is connected to the
initial number of buckets for the hash table.
* In recursive steps, I don't have a good estimate for the number of
groups, so I just estimate it as the number of tuples in that spill
tape (which is pessimistic). That could be improved by doing a real
cardinality estimate as the tuples are spilling (perhaps with HLL?).
* Many aggregates with pass-by-ref transition states don't provide a
great aggtransspace. We should consider doing something smarter, like
having negative numbers represent a number that should be multiplied by
the size of the group (e.g. ARRAY_AGG would have a size dependent on
the group size, not a constant).
* If we want to handle ARRAY_AGG (and the like) well, we can consider
spilling the partial states in the hash table whem the memory is full.
That would add a fair amount of complexity because there would be two
types of spilled data (tuples and partial states), but it could be
useful in some cases.
Regards,
Jeff Davis
From | Date | Subject | |
---|---|---|---|
Next Message | Tomas Vondra | 2020-03-18 23:54:00 | Re: Memory-Bounded Hash Aggregation |
Previous Message | Alvaro Herrera | 2020-03-18 23:24:04 | Re: pgsql: Disk-based Hash Aggregation. |