Quick Links

Re: Large Scale Aggregation (HashAgg Enhancement)

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Rod Taylor <pg(at)rbt(dot)ca>
Cc:	PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Large Scale Aggregation (HashAgg Enhancement)
Date:	2006-01-16 08:32:15
Message-ID:	1137400335.3180.140.camel@localhost.localdomain
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Mon, 2006-01-16 at 00:07 -0500, Rod Taylor wrote:
> A couple of days ago I found myself wanting to aggregate 3 Billion
> tuples down to 100 Million tuples based on an integer key with six
> integer values -- six sum()'s.
>
> PostgreSQL ran out of memory with its Hash Aggregator and doing an old
> style Sort & Sum took a fair amount of time to complete (cancelled the
> process after 24 hours -- small machine).

> Spilling to disk would be nice but I suspect the obvious method would
> thrash quite badly with non-sorted input.

There is already hash table overflow (spill to disk) logic in HashJoins,
so this should be possible by reusing that code for HashAggs. That's on
my todo list, but I'd welcome any assistance.

A question: Are the rows in your 3 B row table clumped together based
upon the 100M row key? (or *mostly* so) We might also be able to
pre-aggregate the rows using a plan like
HashAgg
SortedAgg
or
SortedAgg
Sort
SortedAgg

The first SortedAgg seems superfluous, buy would reduce the row volume
considerably if incoming rows were frequently naturally adjacent, even
if the values were not actually sorted. (This could also be done during
sorting, but its much easier to slot the extra executor step into the
plan). That might then reduce the size of the later sort, or allow it to
become a HashAgg.

I could make that manually enabled using "enable_pre_agg" to allow us to
measure the effectiveness of that technique and decide what cost model
we'd use to make it automatic. Would that help?

> I've written something similar using a client and COPY with temporary
> tables. Even with the Export/Import copy I still beat the Sort&Sum
> method PostgreSQL falls back to.

You can get round this now by chopping the larger table into pieces with
a WHERE clause and then putting them back together with a UNION. If the
table is partitioned, then do this by partitions.

This should also help when it comes to recalculating the sums again in
the future, since you'll only need to rescan the rows that have been
added since the last summation.

Best Regards, Simon Riggs

In response to

Large Scale Aggregation (HashAgg Enhancement) at 2006-01-16 05:07:19 from Rod Taylor

Responses

Re: Large Scale Aggregation (HashAgg Enhancement) at 2006-01-16 14:42:06 from Rod Taylor
Re: Large Scale Aggregation (HashAgg Enhancement) at 2006-01-16 17:36:19 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Simon Riggs	2006-01-16 08:35:09	Re: Coding standards? Recommendations?
Previous Message	Simon Riggs	2006-01-16 08:06:26	Re: ScanKey representation for RowCompare index