From: | Rod Taylor <pg(at)rbt(dot)ca> |
---|---|
To: | PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Large Scale Aggregation (HashAgg Enhancement) |
Date: | 2006-01-16 05:07:19 |
Message-ID: | 1137388039.15377.33.camel@home |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
A couple of days ago I found myself wanting to aggregate 3 Billion
tuples down to 100 Million tuples based on an integer key with six
integer values -- six sum()'s.
PostgreSQL ran out of memory with its Hash Aggregator and doing an old
style Sort & Sum took a fair amount of time to complete (cancelled the
process after 24 hours -- small machine).
Spilling to disk would be nice but I suspect the obvious method would
thrash quite badly with non-sorted input.
One solution is to partially sort the data into various buckets. If we
know how many keys can fit into sort_mem and what the upper and lower
bounds of our keys are then # Keys per MB / sort_mem temporary files can
be created. A sequential scan of the source data would sort each tuple
into the appropriate temporary file. From there we can loop through a
temporary file, HashAgg the contents, present the results, and move to
the next temporary file.
For my particular problem the lower bound is 1 and the upper bound is
about 100M. The sort_mem setting allows HashAgg to handle 1M keys at a
time. The first pass through the 3B tuples would create 100 temporary
files on disk. Temp file 1 would get 1 through 1M, temp file 2 gets keys
1M + 1 through 2M, etc. From there it is pretty easy.
This would allow for a 1000 fold increase in the number of distinct keys
PostgreSQL can simultaneously HashAgg in the default configuration at a
reasonable speed.
I've written something similar using a client and COPY with temporary
tables. Even with the Export/Import copy I still beat the Sort&Sum
method PostgreSQL falls back to.
--
From | Date | Subject | |
---|---|---|---|
Next Message | Simon Riggs | 2006-01-16 08:06:26 | Re: ScanKey representation for RowCompare index |
Previous Message | Tom Lane | 2006-01-16 01:00:05 | Re: pgxs/windows |