Re: HashAgg degenerate case

From: Andres Freund <andres(at)anarazel(dot)de>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: HashAgg degenerate case
Date: 2024-11-12 15:48:31
Message-ID: 3btw7dgj56k5samfzyrxnqdoiocbiycp44osza3iahzn3jqdro@ecxairlbdzqx
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi,

On 2024-11-08 12:40:39 -0800, Jeff Davis wrote:
> On Fri, 2024-11-08 at 10:48 -0800, Jeff Davis wrote:
> > I can think of two approaches to solve it:
>
> Another thought: the point of spilling is to avoid using additional
> memory by adding a group.
>
> If the bucket array is 89.8% full, adding a new group doesn't require
> new buckets need to be allocated, so we only have the firstTuple and
> pergroup data to worry about. If that causes the memory limit to be
> exceeded, it's the perfect time to switch to spill mode.

> But if the bucket array is 89.9% full, then adding a new group will
> cause the bucket array to double. If that causes the memory limit to be
> exceeded, then we can switch to spill mode, but it's wasteful to do so
> because (a) we won't be using most of those new buckets; (b) the new
> buckets will crowd out space for subsequent batches and even fewer of
> the buckets will be used; and (c) the memory accounting can be off by
> quite a bit.

> What if we have a check where, if the metacxt is using more than 40% of
> the memory, and if adding a new group would reach the grow_threshold,
> then enter spill mode immediately?

That assumes the bucket array is a significant portion of the memory usage -
but that's not a given, right? We also might need to grow to keep the number
of conflict chains at a manageable level.

> To make this work, I think we either need to use a tuplehash_lookup()
> followed by a tuplehash_insert() (two lookups for each new group), or we
> would need a new API into simplehash like tuplehash_insert_without_growing()
> that would return NULL instead of growing.

The former seems entirely non-viable on performance grounds. The latter might
be ok, but I'm vary on code complexity grounds.

What about simply checking whether the bucket array was a significant portion
of the memory usage at spill time and recreating the hashtable instead of
resetting IFF it was?

> This approach might not be backportable, but it might be a good approach for
> 18+.

I doubt any change around this ought to be backpatched. The likelihood of
regressions seems to high.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Alvaro Herrera 2024-11-12 19:18:30 Re: pg_rewind WAL segments deletion pitfall
Previous Message Tom Lane 2024-11-12 15:32:17 Re: BUG #18700: pg_dump doesn't include definitions for collations nor types (e.g. enumeration)