From: | David Rowley <david(dot)rowley(at)2ndquadrant(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Stephen Frost <sfrost(at)snowman(dot)net>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Parallel Aggregates for string_agg and array_agg |
Date: | 2018-03-26 22:05:07 |
Message-ID: | CAKJS1f9zUpF4Ntb4=2ba5cQ9YmHptnKUB4tuZZNDmv0OAZ2T4g@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 27 March 2018 at 09:27, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I spent a fair amount of time hacking on this with intent to commit,
> but just as I was getting to code that I liked, I started to have second
> thoughts about whether this is a good idea at all. I quote from the fine
> manual:
>
> The aggregate functions array_agg, json_agg, jsonb_agg,
> json_object_agg, jsonb_object_agg, string_agg, and xmlagg, as well as
> similar user-defined aggregate functions, produce meaningfully
> different result values depending on the order of the input
> values. This ordering is unspecified by default, but can be controlled
> by writing an ORDER BY clause within the aggregate call, as shown in
> Section 4.2.7. Alternatively, supplying the input values from a sorted
> subquery will usually work ...
>
> I do not think it is accidental that these aggregates are exactly the ones
> that do not have parallelism support today. Rather, that's because you
> just about always have an interest in the order in which the inputs get
> aggregated, which is something that parallel aggregation cannot support.
This was not in my list of reasons for not adding them the first time
around. I mentioned these reasons in a response to Stephen.
> I fear that what will happen, if we commit this, is that something like
> 0.01% of the users of array_agg and string_agg will be pleased, another
> maybe 20% will be unaffected because they wrote ORDER BY which prevents
> parallel aggregation, and the remaining 80% will scream because we broke
> their queries. Telling them they should've written ORDER BY isn't going
> to cut it, IMO, when the benefit of that breakage will accrue only to some
> very tiny fraction of use-cases.
This very much reminds me of something that exists in the 8.4 release notes:
> SELECT DISTINCT and UNION/INTERSECT/EXCEPT no longer always produce sorted output (Tom)
> Previously, these types of queries always removed duplicate rows by means of Sort/Unique processing (i.e., sort then remove adjacent duplicates). Now they can be implemented by hashing, which will not produce sorted output. If an application relied on the output being in sorted order, the recommended fix is to add an ORDER BY clause. As a short-term workaround, the previous behavior can be restored by disabling enable_hashagg, but that is a very performance-expensive fix. SELECT DISTINCT ON never uses hashing, however, so its behavior is unchanged.
Seems we were happy enough then to tell users to add an ORDER BY.
However, this case is different, since before the results were always
ordered. This time they're possibly ordered. So we'll probably
surprise fewer people this time around.
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Geoghegan | 2018-03-26 22:10:58 | Re: [HACKERS] MERGE SQL Statement for PG11 |
Previous Message | Andres Freund | 2018-03-26 22:03:23 | Re: [HACKERS] Partition-wise aggregation/grouping |