From: | Jeff Davis <pgsql(at)j-davis(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Taylor Vesely <tvesely(at)pivotal(dot)io>, Adam Lee <ali(at)pivotal(dot)io>, Melanie Plageman <mplageman(at)pivotal(dot)io>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Memory-Bounded Hash Aggregation |
Date: | 2019-12-05 02:55:43 |
Message-ID: | 2c8dc5c9178c391493dfbf6c2f595dc298dd477e.camel@j-davis.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Thanks very much for a great review! I've attached a new patch.
There are some significant changes in the new version also:
In the non-spilling path, removed the extra nullcheck branch in the
compiled evaltrans expression. When the first tuple is spilled, I the
branch becomes necessary, so I recompile the expression using a new
opcode that includes that branch.
I also changed the read-from-spill path to use a slot with
TTSOpsMinimalTuple (avoiding the need to make it into a virtual slot
right away), which means I need to recompile the evaltrans expression
for that case, as well.
I also improved the way we initialize the hash tables to use a better
estimate for the number of groups. And I made it only initialize one
hash table in the read-from-spill path.
With all of the changes I made (thanks to some suggestions from Andres)
the performance is looking pretty good. It's pretty easy to beat
Sort+Group when the group size is 10+. Even for average group size of
~1, HashAgg is getting really close to Sort in some cases.
There are still a few things to do, most notably costing. I also need
to project before spilling to avoid wasting disk. And I'm sure my
changes have created some more problems, so I have some significant
work to do on quality.
My answers to your questions inline:
On Thu, 2019-11-28 at 18:46 +0100, Tomas Vondra wrote:
> Could we instead track how many tuples we actually consumed for the
> the
> in-memory groups, and then use this information to improve the
> estimate
> of number of groups? I mean, if we know we've consumed 1000 tuples
> which
> created 100 groups, then we know there's ~1:10 ratio.
That would be a good estimate for an even distribution, but not
necessarily for a skewed distribution. I'm not opposed to it, but it's
generally my philosophy to overpartition as it seems there's not a big
downside.
> A couple of comments based on eye-balling the patch:
>
>
> 1) Shouldn't the hashagg_mem_overflow use the other GUC naming, i.e.
> maybe it should be enable_hashagg_mem_overflow or something similar?
The enable_* naming is for planner GUCs. hashagg_mem_overflow is an
execution-time GUC that disables spilling and overflows work_mem (that
is, it reverts to the old behavior).
>
> assume the pointer really is NULL. Surely we'll get a segfault on the
> preceding line which does dereference it
>
> pergroup = &pergroup_allaggs[op->d.agg_init_trans.transno];
>
> Or am I missing anything?
That's not actually dereferencing anything, it's just doing a pointer
calculation. You are probably right that it's not a good thing to rely
on, or at least not quite as readable, so I changed the order to put
the NULL check first.
>
> 3) execGrouping.c
>
> A couple of functions would deserve a comment, explaining what it
> does.
>
> - LookupTupleHashEntryHash
> - prepare_hash_slot
> - calculate_hash
Done, thank you.
> And it's not clear to me why we should remove part of the comment
> before
> TupleHashTableHash.
Trying to remember back to when I first did that, but IIRC the comment
was not updated from a previous change, and I was cleaning it up. I
will check over that again to be sure it's an improvement.
>
> 4) I'm not sure I agree with this reasoning that
> HASH_PARTITION_FACTOR
> making the hash tables smaller is desirable - it may be, but if that
> was
> generally the case we'd just use small hash tables all the time. It's
> a
> bit annoying to give user the capability to set work_mem and then
> kinda
> override that.
I think adding some kind of headroom is reasonable to avoid recursively
spilling, but perhaps it's not critical. I see this as a tuning
question more than anything else. I don't see it as "overriding"
work_mem, but I can see where you're coming from.
> 5) Not sure what "directly" means in this context?
>
> * partitions at the time we need to spill, and because this
> algorithm
> * shouldn't depend too directly on the internal memory needs of a
> * BufFile.
>
> #define HASH_PARTITION_MEM (HASH_MIN_PARTITIONS * BLCKSZ)
>
> Does that mean we don't want to link to PGAlignedBlock, or what?
That's what I meant, yes, but I reworded the comment to not say that.
> 6) I think we should have some protection against underflows in this
> piece of code:
>
> - this would probably deserve some protection against underflow if
> HASH_PARTITION_MEM gets too big
>
> if (hashagg_mem_overflow)
> aggstate->hash_mem_limit = SIZE_MAX;
> else
> aggstate->hash_mem_limit = (work_mem * 1024L) -
> HASH_PARTITION_MEM;
>
> At the moment it's safe because work_mem is 64kB at least, and
> HASH_PARTITION_MEM is 32kB (4 partitions, 8kB each). But if we happen
> to
> bump HASH_MIN_PARTITIONS up, this can underflow.
Thank you, done.
> 7) Shouldn't lookup_hash_entry briefly explain why/how it handles the
> memory limit?
Improved.
>
> 8) The comment before lookup_hash_entries says:
>
> ...
> * Return false if hash table has exceeded its memory limit.
> ..
>
> But that's clearly bogus, because that's a void function.
Thank you, improved comment.
> 9) Shouldn't the hash_finish_initial_spills calls in
> agg_retrieve_direct
> have a comment, similar to the surrounding code? Might be an
> overkill,
> not sure.
Sure, done.
> 10) The comment for agg_refill_hash_table says
>
> * Should only be called after all in memory hash table entries have
> been
> * consumed.
>
> Can we enforce that with an assert, somehow?
It's a bit awkward. Simplehash doesn't expose the number of groups, and
we would also have to check each hash table. Not a bad idea to add an
interface to simplehash to make that work, though.
> 11) The hash_spill_npartitions naming seems a bit confusing, because
> it
> seems to imply it's about the "spill" while in practice it just
> choses
> number of spill partitions. Maybe hash_choose_num_spill_partitions
> would
> be better?
Done.
> 12) It's not clear to me why we need HASH_MAX_PARTITIONS? What's the
> reasoning behind the current value (256)? Not wanting to pick too
> many
> partitions? Comment?
>
> if (npartitions > HASH_MAX_PARTITIONS)
> npartitions = HASH_MAX_PARTITIONS;
Added a comment. There's no deep reasoning there -- I just don't want
it to choose to create 5000 files and surprise a user.
> 13) As for this:
>
> /* make sure that we don't exhaust the hash bits */
> if (partition_bits + input_bits >= 32)
> partition_bits = 32 - input_bits;
>
> We already ran into this issue (exhausting bits in a hash value) in
> hashjoin batching, we should be careful to use the same approach in
> both
> places (not the same code, just general approach).
Didn't investigate this yet, but will do.
Regards,
Jeff Davis
Attachment | Content-Type | Size |
---|---|---|
hashagg-20191204.diff | text/x-patch | 88.2 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Kyotaro Horiguchi | 2019-12-05 03:06:54 | Re: Increase footprint of %m and reduce strerror() |
Previous Message | Kyotaro Horiguchi | 2019-12-05 02:37:23 | Re: Session WAL activity |