Quick Links

Re: Memory-Bounded Hash Aggregation

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Taylor Vesely <tvesely(at)pivotal(dot)io>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Adam Lee <ali(at)pivotal(dot)io>, Melanie Plageman <mplageman(at)pivotal(dot)io>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Memory-Bounded Hash Aggregation
Date:	2019-08-30 06:28:12
Message-ID:	038966b0ba9d7e0445d15d09908ac82202c55a30.camel@j-davis.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, 2019-08-28 at 12:52 -0700, Taylor Vesely wrote:
> I started to review this patch yesterday with Melanie Plageman, so we
> rebased this patch over the current master. The main conflicts were
> due to a simplehash patch that has been committed separately[1]. I've
> attached the rebased patch.

Great, thanks!

> I was playing with the code, and if one of the table's most common
> values isn't placed into the initial hash table it spills a whole lot
> of tuples to disk that might have been avoided if we had some way to
> 'seed' the hash table with MCVs from the statistics. Seems to me that
> you would need some way of dealing with values that are in the MCV
> list, but ultimately don't show up in the scan. I imagine that this
> kind of optimization would most useful for aggregates on a full table
> scan.

Interesting idea, I didn't think of that.

> Some questions:
>
> Right now the patch always initializes 32 spill partitions. Have you
> given
> any thought into how to intelligently pick an optimal number of
> partitions yet?

Yes. The idea is to guess how many groups are remaining, then guess how
much space they will need in memory, then divide by work_mem. I just
didn't get around to it yet. (Same with the costing work.)

> By add-on approach, do you mean to say that you have something in
> mind
> to combine the two strategies? Or do you mean that it could be
> implemented
> as a separate strategy?

It would be an extension of the existing patch, but would add a fair
amount of complexity (dealing with partial states, etc.) and the
benefit would be fairly modest. We can do it later if justified.

> That said, I think it's worth mentioning that with parallel
> aggregates
> it might actually be more useful to spill the trans values instead,
> and have them combined in a Gather or Finalize stage.

That's a good point.

Regards,
Jeff Davis

In response to

Re: Memory-Bounded Hash Aggregation at 2019-08-28 19:52:13 from Taylor Vesely

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Fabien COELHO	2019-08-30 06:44:28	Re: pg_upgrade: Error out on too many command-line arguments
Previous Message	Etsuro Fujita	2019-08-30 06:13:37	Re: A problem about partitionwise join