Re: Memory-Bounded Hash Aggregation

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Taylor Vesely <tvesely(at)pivotal(dot)io>, Adam Lee <ali(at)pivotal(dot)io>, Melanie Plageman <mplageman(at)pivotal(dot)io>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Memory-Bounded Hash Aggregation
Date: 2019-12-05 02:55:43
Message-ID: 2c8dc5c9178c391493dfbf6c2f595dc298dd477e.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Thanks very much for a great review! I've attached a new patch.

There are some significant changes in the new version also:

In the non-spilling path, removed the extra nullcheck branch in the
compiled evaltrans expression. When the first tuple is spilled, I the
branch becomes necessary, so I recompile the expression using a new
opcode that includes that branch.

I also changed the read-from-spill path to use a slot with
TTSOpsMinimalTuple (avoiding the need to make it into a virtual slot
right away), which means I need to recompile the evaltrans expression
for that case, as well.

I also improved the way we initialize the hash tables to use a better
estimate for the number of groups. And I made it only initialize one
hash table in the read-from-spill path.

With all of the changes I made (thanks to some suggestions from Andres)
the performance is looking pretty good. It's pretty easy to beat
Sort+Group when the group size is 10+. Even for average group size of
~1, HashAgg is getting really close to Sort in some cases.

There are still a few things to do, most notably costing. I also need
to project before spilling to avoid wasting disk. And I'm sure my
changes have created some more problems, so I have some significant
work to do on quality.

My answers to your questions inline:

On Thu, 2019-11-28 at 18:46 +0100, Tomas Vondra wrote:
> Could we instead track how many tuples we actually consumed for the
> the
> in-memory groups, and then use this information to improve the
> estimate
> of number of groups? I mean, if we know we've consumed 1000 tuples
> which
> created 100 groups, then we know there's ~1:10 ratio.

That would be a good estimate for an even distribution, but not
necessarily for a skewed distribution. I'm not opposed to it, but it's
generally my philosophy to overpartition as it seems there's not a big
downside.

> A couple of comments based on eye-balling the patch:
>
>
> 1) Shouldn't the hashagg_mem_overflow use the other GUC naming, i.e.
> maybe it should be enable_hashagg_mem_overflow or something similar?

The enable_* naming is for planner GUCs. hashagg_mem_overflow is an
execution-time GUC that disables spilling and overflows work_mem (that
is, it reverts to the old behavior).

>
> assume the pointer really is NULL. Surely we'll get a segfault on the
> preceding line which does dereference it
>
> pergroup = &pergroup_allaggs[op->d.agg_init_trans.transno];
>
> Or am I missing anything?

That's not actually dereferencing anything, it's just doing a pointer
calculation. You are probably right that it's not a good thing to rely
on, or at least not quite as readable, so I changed the order to put
the NULL check first.

>
> 3) execGrouping.c
>
> A couple of functions would deserve a comment, explaining what it
> does.
>
> - LookupTupleHashEntryHash
> - prepare_hash_slot
> - calculate_hash

Done, thank you.

> And it's not clear to me why we should remove part of the comment
> before
> TupleHashTableHash.

Trying to remember back to when I first did that, but IIRC the comment
was not updated from a previous change, and I was cleaning it up. I
will check over that again to be sure it's an improvement.

>
> 4) I'm not sure I agree with this reasoning that
> HASH_PARTITION_FACTOR
> making the hash tables smaller is desirable - it may be, but if that
> was
> generally the case we'd just use small hash tables all the time. It's
> a
> bit annoying to give user the capability to set work_mem and then
> kinda
> override that.

I think adding some kind of headroom is reasonable to avoid recursively
spilling, but perhaps it's not critical. I see this as a tuning
question more than anything else. I don't see it as "overriding"
work_mem, but I can see where you're coming from.

> 5) Not sure what "directly" means in this context?
>
> * partitions at the time we need to spill, and because this
> algorithm
> * shouldn't depend too directly on the internal memory needs of a
> * BufFile.
>
> #define HASH_PARTITION_MEM (HASH_MIN_PARTITIONS * BLCKSZ)
>
> Does that mean we don't want to link to PGAlignedBlock, or what?

That's what I meant, yes, but I reworded the comment to not say that.

> 6) I think we should have some protection against underflows in this
> piece of code:
>
> - this would probably deserve some protection against underflow if
> HASH_PARTITION_MEM gets too big
>
> if (hashagg_mem_overflow)
> aggstate->hash_mem_limit = SIZE_MAX;
> else
> aggstate->hash_mem_limit = (work_mem * 1024L) -
> HASH_PARTITION_MEM;
>
> At the moment it's safe because work_mem is 64kB at least, and
> HASH_PARTITION_MEM is 32kB (4 partitions, 8kB each). But if we happen
> to
> bump HASH_MIN_PARTITIONS up, this can underflow.

Thank you, done.

> 7) Shouldn't lookup_hash_entry briefly explain why/how it handles the
> memory limit?

Improved.

>
> 8) The comment before lookup_hash_entries says:
>
> ...
> * Return false if hash table has exceeded its memory limit.
> ..
>
> But that's clearly bogus, because that's a void function.

Thank you, improved comment.

> 9) Shouldn't the hash_finish_initial_spills calls in
> agg_retrieve_direct
> have a comment, similar to the surrounding code? Might be an
> overkill,
> not sure.

Sure, done.

> 10) The comment for agg_refill_hash_table says
>
> * Should only be called after all in memory hash table entries have
> been
> * consumed.
>
> Can we enforce that with an assert, somehow?

It's a bit awkward. Simplehash doesn't expose the number of groups, and
we would also have to check each hash table. Not a bad idea to add an
interface to simplehash to make that work, though.

> 11) The hash_spill_npartitions naming seems a bit confusing, because
> it
> seems to imply it's about the "spill" while in practice it just
> choses
> number of spill partitions. Maybe hash_choose_num_spill_partitions
> would
> be better?

Done.

> 12) It's not clear to me why we need HASH_MAX_PARTITIONS? What's the
> reasoning behind the current value (256)? Not wanting to pick too
> many
> partitions? Comment?
>
> if (npartitions > HASH_MAX_PARTITIONS)
> npartitions = HASH_MAX_PARTITIONS;

Added a comment. There's no deep reasoning there -- I just don't want
it to choose to create 5000 files and surprise a user.

> 13) As for this:
>
> /* make sure that we don't exhaust the hash bits */
> if (partition_bits + input_bits >= 32)
> partition_bits = 32 - input_bits;
>
> We already ran into this issue (exhausting bits in a hash value) in
> hashjoin batching, we should be careful to use the same approach in
> both
> places (not the same code, just general approach).

Didn't investigate this yet, but will do.

Regards,
Jeff Davis

Attachment Content-Type Size
hashagg-20191204.diff text/x-patch 88.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2019-12-05 03:06:54 Re: Increase footprint of %m and reduce strerror()
Previous Message Kyotaro Horiguchi 2019-12-05 02:37:23 Re: Session WAL activity