From: | Greg Stark <gsstark(at)mit(dot)edu> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Simon Riggs <simon(at)2ndquadrant(dot)com>, Rod Taylor <pg(at)rbt(dot)ca>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Large Scale Aggregation (HashAgg Enhancement) |
Date: | 2006-01-17 05:01:38 |
Message-ID: | 87y81fwiq5.fsf@stark.xeocode.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> > Why would we continue to dynamically build the hash table after the
> > start of the outer scan?
>
> The number of tuples written to a temp file might exceed what we want to
> hold in memory; we won't detect this until the batch is read back in,
> and in that case we have to split the batch at that time. For hashagg
> this point would apply to the aggregate states not the input tuples, but
> it's still a live problem (especially if the aggregate states aren't
> fixed-size values ... consider a "concat" aggregate for instance).
For a hash aggregate would it be possible to rescan the original table instead
of spilling to temporary files? Then when you run out of working memory you
simply throw out half the hash table and ignore subsequent tuples that fall in
those hash buckets. Then you rescan for the discarded hash bucket regions.
This avoids having to do any disk writes at the expense possibly of additional
reads. I think in terms of i/o it would be much faster in most cases.
The downsides are: a) volatile aggregates or aggregates with side-effects
would be confused by being executed twice. I'm not clear that volatile
aggregate functions make any sense anyways though. b) I'm unclear whether
rescanning the table could potentially find tuples in a different state than
previous scans. If so then the idea doesn't work at all. But I don't think
that's possible is it?
The main problem is c) it may lose in terms of i/o for cases where the
cardinality is low (ie, it's overflowing despite having low cardinality
because the table is really really big too). But most cases will be spilling
because the cardinality is high. So the table may be big but the spill files
are nearly as big anyways and having to write and then read them means double
the i/o.
The upside of not having to write out temporary files is big. I find queries
that require temporary sort files get hit with a *huge* performance penalty.
Often an order of magnitude. Part of that could probably be mitigated by
having the sort files on a separate spindle but I think it's always going to
hurt especially if there are multiple operations spilling to disk
simultaneously.
--
greg
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2006-01-17 05:15:02 | Re: [HACKERS] source documentation tool doxygen |
Previous Message | Larry Rosenman | 2006-01-17 03:35:40 | FW: PGBuildfarm member firefly Branch HEAD Failed at Stage Check |