Re: HashAgg degenerate case

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: HashAgg degenerate case
Date: 2024-11-12 23:15:55
Message-ID: e061c5439996e13c0fb51997e92d6a09834a7796.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Tue, 2024-11-12 at 10:48 -0500, Andres Freund wrote:
> That assumes the bucket array is a significant portion of the memory
> usage -
> but that's not a given, right?

TupleHashEntryData is 24 bytes, and a the smallest firstTuple (an
integer key and that's it) is 20 bytes plus palloc overhead. So even if
the hash table is perfectly full, it can still be ~50% of total memory
usage. The fraction can of course be smaller with compound grouping
keys and multiple aggregates, but it is still likely to be a
significant fraction.

For example:

create table big(i int, j int);
insert into big select g, 1 from generate_series(1,10000000) g;
vacuum freeze analyze big; checkpoint;
set hash_mem_multiplier = 1;
set work_mem = '430MB';
explain analyze select count(*) from
(select i from big group by i) s;

At peak, it uses ~630MB total (exceeding the memory limit by 40%), of
which ~400MB is in the metacxt. The hashtab has a size of 16777216 and
5896580 members.

Due to the cases that cause grow_threshold to be set to zero, the
bucket array is often about 3X the number of members just after
growing. And if the bucket array consumes most or all of the available
memory, that crowds out memory for the next batch, which may
recursively spill with a very sparse bucket array. That makes it all
the more important that we get that memory usage of the bucket array
under control.

A silver lining is that smaller hash tables often perform better, so
restricting batch size doesn't necessarily cause a regression and may
even improve things. I've known this since I originally wrote the disk-
based HashAgg, and perhaps we should explore this more.

While I'm looking, I also see a couple other areas of potential
improvement:

* can we reduce the size of TupleHashEntryData by having only one
pointer?
* it looks like we should be even more aggressive about partitioning
widely. Sometimes a larger work mem creates a smaller number of
partitions, and it can end up recursing more and producing more total
batches

> The former seems entirely non-viable on performance grounds.

Yeah, I think that was discussed during the original feature
development.

>   The latter might
> be ok, but I'm vary on code complexity grounds.

I attached a proof of concept patch, and it does add some complexity,
but it's not terrible. If tuplehash is going to be used for large hash
tables, I don't think we can hide the doubling nature from the caller
forever, assuming that the caller can do something about it (like
spill).

> What about simply checking whether the bucket array was a significant
> portion
> of the memory usage at spill time and recreating the hashtable
> instead of
> resetting IFF it was?

I think that would basically solve the degenerate case, but:

(a) if "significant" means ~40%, then it will trigger recreation for
a lot of cases, and it re-raises the questions about performance
regressions; and
(b) it doesn't correct the problem where we grow the hashtable just
to hold one additional element, causing it to exceed the available
memory and not be able to actually use those new buckets.
(c) it doesn't prevent overshooting the memory limit by a significant
amount

A slightly better approach might be to recreate the hashtable if the
last batch had a hashtable that is less than 10% filled or something,
which will hopefully avoid problem (a).

>
> I doubt any change around this ought to be backpatched. The
> likelihood of
> regressions seems to high.

Perhaps there's some low-risk change we can make, like having a higher
minimum number of groups before spilling, or something like that?

I can move this discussion to -hackers until we sort it out.

Regards,
Jeff Davis

Attachment Content-Type Size
v1-0001-simplehash-add-grow_ok-to-API.patch text/x-patch 3.9 KB
v1-0002-HashAgg-avoid-unexpected-hash-table-growth-near-m.patch text/x-patch 11.8 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Noah Misch 2024-11-13 00:26:57 Re: BUG #17821: Assertion failed in heap_update() due to heap pruning
Previous Message Alvaro Herrera 2024-11-12 19:18:30 Re: pg_rewind WAL segments deletion pitfall