Quick Links

Re: Spilling hashed SetOps and aggregates to disk

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, David Fetter <david(at)fetter(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Spilling hashed SetOps and aggregates to disk
Date:	2018-06-11 18:51:09
Message-ID:	5bc4336f-c243-183b-4dc2-9bbf2ff22ee2@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 06/11/2018 08:13 PM, Jeff Davis wrote:
> On Mon, 2018-06-11 at 19:33 +0200, Tomas Vondra wrote:
>> For example we hit the work_mem limit after processing 10% tuples,
>> switching to sort would mean spill+sort of 900GB of data. Or we
>> might say - hmm, we're 10% through, so we expect hitting the limit
>> 10x, so let's spill the hash table and then do sort on that,
>> writing and sorting only 10GB of data. (Or merging it in some
>> hash-based way, per Robert's earlier message.)
>
> Your example depends on large groups and a high degree of group
> clustering. That's fine, but it's a special case,
>

True, it's a special case and it won't work for other cases. It was
merely an example for Andres.

OTOH it's not entirely unrealistic, I think. Consider something like

SELECT
extract(year from ts) AS y,
extract(month from ts) AS m,
extract(day from ts) AS d,
string_agg(x),
array_agg(y)
FROM fact_table
GROUP BY y, m, d;

which is likely very correlated (assuming new data is appended to the
table), and the string_agg/array_agg are likely to produce fairly large
groups (about proportional to the number of tuples in the group).

Another example might be about HLL aggregate, although in that case the
transition state does not grow, so it may not be that bad (and the
default estimate of 1kB would work pretty nicely). But there certainly
are other aggregates with large transition state, where this might not
be the case, and we currently have no way to communicate that to the
planner - except for setting work_mem much lower :-/

However, I now realize I've ignored the fact that we typically don't
sort the whole table but only a very few columns, so the example was not
entirely fair - we would not sort the whole remaining 900GB but likely
much less.

> and complexity does have a cost, too.

Sure.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Re: Spilling hashed SetOps and aggregates to disk at 2018-06-11 18:13:45 from Jeff Davis

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Julien Rouhaud	2018-06-11 18:51:21	Re: Proposal: Partitioning Advisor for PostgreSQL
Previous Message	Alvaro Herrera	2018-06-11 18:37:32	Re: PostgreSQL vs SQL Standard