Re: Default setting for enable_hashagg_disk

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Davis <pgsql(at)j-davis(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Andres Freund <andres(at)anarazel(dot)de>, Bruce Momjian <bruce(at)momjian(dot)us>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Default setting for enable_hashagg_disk
Date: 2020-07-26 00:05:09
Message-ID: 20200726000509.vtl4p7vzazdr2dif@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-docs pgsql-hackers

On Sat, Jul 25, 2020 at 10:07:37AM -0700, Peter Geoghegan wrote:
>On Sat, Jul 25, 2020 at 9:39 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>> "Peak Memory Usage: 1605334kB"
>>
>> Hash agg avoids spilling entirely (so the planner gets it right this
>> time around). It even uses notably less memory.
>
>I guess that this is because the reported memory usage doesn't reflect
>the space used for transition state, which is presumably most of the
>total -- array_agg() is used in the query.
>

I'm not sure what you mean by "reported memory usage doesn't reflect the
space used for transition state"? Surely it does include that, we've
built the memory accounting stuff pretty much exactly to do that.

I think it's pretty clear what's happening - in the sorted case there's
only a single group getting new values at any moment, so when we decide
to spill we'll only add rows to that group and everything else will be
spilled to disk.

In the unsorted case however we manage to initialize all groups in the
hash table, but at that point the groups are tiny an fit into work_mem.
As we process more and more data the groups grow, but we can't evict
them - at the moment we don't have that capability. So we end up
processing everything in memory, but significantly exceeding work_mem.

FWIW all my tests are done on the same TPC-H data set clustered by
l_shipdate (so probably random with respect to other columns).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-docs by date

  From Date Subject
Next Message Peter Geoghegan 2020-07-26 00:13:00 Re: Default setting for enable_hashagg_disk
Previous Message Jeff Davis 2020-07-25 23:56:51 Re: Default setting for enable_hashagg_disk

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2020-07-26 00:13:00 Re: Default setting for enable_hashagg_disk
Previous Message Jeff Davis 2020-07-25 23:56:51 Re: Default setting for enable_hashagg_disk