Quick Links

Re: Memory-Bounded Hash Aggregation

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Adam Lee <ali(at)pivotal(dot)io>, Melanie Plageman <mplageman(at)pivotal(dot)io>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Memory-Bounded Hash Aggregation
Date:	2019-08-02 15:21:23
Message-ID:	20190802152123.ichmacgx3y3x7dc3@development
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Fri, Aug 02, 2019 at 08:11:19AM -0700, Jeff Davis wrote:
>On Fri, 2019-08-02 at 14:44 +0800, Adam Lee wrote:
>> I'm late to the party.
>
>You are welcome to join any time!
>
>> These two approaches both spill the input tuples, what if the skewed
>> groups are not encountered before the hash table fills up? The spill
>> files' size and disk I/O could be downsides.
>
>Let's say the worst case is that we encounter 10 million groups of size
>one first; just enough to fill up memory. Then, we encounter a single
>additional group of size 20 million, and need to write out all of those
>20 million raw tuples. That's still not worse than Sort+GroupAgg which
>would need to write out all 30 million raw tuples (in practice Sort is
>pretty fast so may still win in some cases, but not by any huge
>amount).
>
>> Greenplum spills all the groups by writing the partial aggregate
>> states,
>> reset the memory context, process incoming tuples and build in-memory
>> hash table, then reload and combine the spilled partial states at
>> last,
>> how does this sound?
>
>That can be done as an add-on to approach #1 by evicting the entire
>hash table (writing out the partial states), then resetting the memory
>context.
>
>It does add to the complexity though, and would only work for the
>aggregates that support serializing and combining partial states. It
>also might be a net loss to do the extra work of initializing and
>evicting a partial state if we don't have large enough groups to
>benefit.
>
>Given that the worst case isn't worse than Sort+GroupAgg, I think it
>should be left as a future optimization. That would give us time to
>tune the process to work well in a variety of cases.
>

+1 to leaving that as a future optimization

I think it's clear there's no perfect eviction strategy - for every
algorithm we came up with we can construct a data set on which it
performs terribly (I'm sure we could do that for the approach used by
Greenplum, for example).

So I think it makes sense to do what Jeff proposed, and then maybe try
improving that in the future with a switch to different eviction
strategy based on some heuristics.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Re: Memory-Bounded Hash Aggregation at 2019-08-02 15:11:19 from Jeff Davis

Responses

Re: Memory-Bounded Hash Aggregation at 2019-08-28 19:52:13 from Taylor Vesely

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Floris Van Nee	2019-08-02 15:23:23	Optimize single tuple fetch from nbtree index
Previous Message	Jeff Davis	2019-08-02 15:11:19	Re: Memory-Bounded Hash Aggregation