From: | Simon Riggs <simon(at)2ndquadrant(dot)com> |
---|---|
To: | Rod Taylor <pg(at)rbt(dot)ca> |
Cc: | PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Large Scale Aggregation (HashAgg Enhancement) |
Date: | 2006-01-16 08:32:15 |
Message-ID: | 1137400335.3180.140.camel@localhost.localdomain |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, 2006-01-16 at 00:07 -0500, Rod Taylor wrote:
> A couple of days ago I found myself wanting to aggregate 3 Billion
> tuples down to 100 Million tuples based on an integer key with six
> integer values -- six sum()'s.
>
> PostgreSQL ran out of memory with its Hash Aggregator and doing an old
> style Sort & Sum took a fair amount of time to complete (cancelled the
> process after 24 hours -- small machine).
> Spilling to disk would be nice but I suspect the obvious method would
> thrash quite badly with non-sorted input.
There is already hash table overflow (spill to disk) logic in HashJoins,
so this should be possible by reusing that code for HashAggs. That's on
my todo list, but I'd welcome any assistance.
A question: Are the rows in your 3 B row table clumped together based
upon the 100M row key? (or *mostly* so) We might also be able to
pre-aggregate the rows using a plan like
HashAgg
SortedAgg
or
SortedAgg
Sort
SortedAgg
The first SortedAgg seems superfluous, buy would reduce the row volume
considerably if incoming rows were frequently naturally adjacent, even
if the values were not actually sorted. (This could also be done during
sorting, but its much easier to slot the extra executor step into the
plan). That might then reduce the size of the later sort, or allow it to
become a HashAgg.
I could make that manually enabled using "enable_pre_agg" to allow us to
measure the effectiveness of that technique and decide what cost model
we'd use to make it automatic. Would that help?
> I've written something similar using a client and COPY with temporary
> tables. Even with the Export/Import copy I still beat the Sort&Sum
> method PostgreSQL falls back to.
You can get round this now by chopping the larger table into pieces with
a WHERE clause and then putting them back together with a UNION. If the
table is partitioned, then do this by partitions.
This should also help when it comes to recalculating the sums again in
the future, since you'll only need to rescan the rows that have been
added since the last summation.
Best Regards, Simon Riggs
From | Date | Subject | |
---|---|---|---|
Next Message | Simon Riggs | 2006-01-16 08:35:09 | Re: Coding standards? Recommendations? |
Previous Message | Simon Riggs | 2006-01-16 08:06:26 | Re: ScanKey representation for RowCompare index |