Re: Default setting for enable_hashagg_disk

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Jeff Davis <pgsql(at)j-davis(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Default setting for enable_hashagg_disk
Date: 2020-06-26 20:53:05
Message-ID: CAH2-WzmK_hvWfUCabm5Fe+e-YF3oV0q8XpNA5vfyDO_0Hv9K2A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-docs pgsql-hackers

On Thu, Jun 25, 2020 at 1:36 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> 12 28164.865 ms
>
> fast ssd:
> HEAD 92520.680 ms
>
> magnetic:
> HEAD 183968.538 ms
>
> (no reads, there's plenty enough memory. Just writes because the age /
> amount thresholds for dirty data are reached)
>
> In the magnetic case we're IO bottlenecked nearly the whole time.

I agree with almost everything you've said on this thread, but at the
same time I question the emphasis on I/O here. You've shown that
spinning rust is about twice as slow as a fast SSD here. Fair enough,
but to me the real story is that spilling is clearly a lot slower in
general, regardless of how fast the storage subsystem happens to be (I
wonder how fast it is with a ramdisk). To me, it makes more sense to
think of the problem here as the fact that v13 will *not* do
aggregation using the fast strategy (i.e. in-memory) -- as opposed to
the problem being that v13 does the aggregation using the slow
strategy (which is assumed to be slow because it involves I/O instead
of memory buffers).

I get approximately the same query runtimes with your "make the
transition state a bit bigger" test case. With "set enable_hashagg =
off", I get a group aggregate + sort. It spills to disk, even with
'work_mem = '15GB'" -- leaving 4 runs to merge at the end. That takes
63702.992 ms on v13. But if I reduce the amount of work_mem radically,
to only 16MB (a x960 decrease!), then the run time only increases by
~30% -- it's only 83123.926 ms. So we're talking about a ~200%
increase (for hash aggregate) versus a ~30% increase (for groupagg +
sort) on fast SSDs.

Changing the cost of I/O in the context of hashaggregate seems like it
misses the point. Jeff recently said "Overall, the IO pattern is
better for Sort, but not dramatically so". Whatever the IO pattern may
be, I think that it's pretty clear that the performance
characteristics of hash aggregation with limited memory are very
different to groupaggregate + sort, at least when only a fraction of
the optimal amount of memory we'd like is available. It's true that
hash aggregate was weird among plan nodes in v12, and is now in some
sense less weird among plan nodes. And yet we have a new problem now
-- so where does that leave that whole "weirdness" framing? ISTM that
the work_mem framework was and is the main problem. We seem to have
lost a crutch that ameliorated the problem before now, even though
that amelioration was kind of an accident. Or a thing that user apps
evolved to rely on.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-docs by date

  From Date Subject
Next Message Bruce Momjian 2020-06-26 23:00:20 Re: Default setting for enable_hashagg_disk
Previous Message Tomas Vondra 2020-06-26 17:45:13 Re: Default setting for enable_hashagg_disk

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexey Kondratov 2020-06-26 21:08:03 Re: [PATCH] Allow to specify restart_lsn in pg_create_physical_replication_slot()
Previous Message Fabien COELHO 2020-06-26 20:26:24 Re: Why forbid "INSERT INTO t () VALUES ();"