Re: Wrong results with grouping sets

From: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
To: Richard Guo <guofenglinux(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Wrong results with grouping sets
Date: 2024-07-04 10:02:26
Message-ID: CAExHW5uHYQhiNoD1VXaEzFhN7c=qiWRZ9HPAeNzkGOZ=W=8e=Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jul 1, 2024 at 1:59 PM Richard Guo <guofenglinux(at)gmail(dot)com> wrote:
>
> On Mon, Jun 10, 2024 at 5:05 PM Richard Guo <guofenglinux(at)gmail(dot)com> wrote:
> > This patchset does not apply any more. Here is a new rebase.
>
> Here is an updated version of this patchset. I've run pgindent for it,
> and also tweaked the commit messages a bit.
>
> In principle, 0001 can be backpatched to all supported versions to fix
> the cases where there are subqueries in the grouping expressions; 0002
> can be backpatched to 16 where we have the nullingrels stuff. But both
> patches seem to be quite invasive. I'm not sure if we want to backpatch
> them to stable branches. Any thoughts about backpatching?

I don't have any specific thoughts on backpatching, but I have started
reviewing the patches.

The first patch in the set adds a new RTEKind for GROUP. From prologue
of RangeTblEntry structure I can not understand what an RTE represents
especially when the RTE represents something other than a FROM clause
item.
```
* This is because we only need the RTE to deal with SQL features
* like outer joins and join-output-column aliasing.) Other special
* RTE types also exist, as indicated by RTEKind.
```
I can not use this description to decide whether a GROUP BY construct
should have an RTE for itself or not. It looks like the patch adds a
new RTE (kind) here so that its rtindex can be used to differentiate
between a and b from VALUES clause and those from the GroupingSet
result in the query mentioned in the first email in this thread. But I
don't see any discussion of other alternatives. For example, how about
infrastructure in EC to tell which stages this EC is valid for/upto? I
see Tom suggesting use of RTE instead of changing EC but I don't see
why that's better. We do mark a RestrictInfo with relids above which
it can be computed. Similarly we assess validity of EC by stages or
relations being computed. That might open some opportunities for using
broken ECs? We are almost reimplementing parts of the GROUPING set
feature, so may be it's worth spending time thinking about it.

Assuming new RTEkind is the right approach, I am wondering whether
there are other things that should have been represented by RTE for
the same reason. For example, a HAVING clause changes the
characteristics of results by introducing new constraints on the
aggregated results. Should that have an RTE by itself? Will the
RTEKind introduced by this patch cover HAVING clause as well? Will
that open opportunities for more optimizations E.g.
```
explain select sum(a), sum(b), stddev(a + b) from (values (1, 1), (2,
2)) as t(a, b) group by a, b having sum(a) = sum(b) order by 1, 2;
QUERY PLAN
-------------------------------------------------------------------------
Sort (cost=0.10..0.10 rows=1 width=56)
Sort Key: (sum("*VALUES*".column1)), (sum("*VALUES*".column2))
-> HashAggregate (cost=0.06..0.09 rows=1 width=56)
Group Key: "*VALUES*".column1, "*VALUES*".column2
Filter: (sum("*VALUES*".column1) = sum("*VALUES*".column2))
-> Values Scan on "*VALUES*" (cost=0.00..0.03 rows=2 width=8)
(6 rows)
```
Sort Key can be just (sum("*VALUES*".column1)) instead of both
(sum("*VALUES*".column1)), (sum("*VALUES*".column2)) because of HAVING
clause?

Some code level random comments
1.
```
if (rte->rtekind == RTE_GROUP)
{
es->rtable_size--;
break;
```
because of the variable name, it would be interpreted as the size of
es->rtable and will be expected to be the same as
list_length(es->rtable) which it is not. The comment at the member
declaration won't help much for a quick reader. All that variable is
doing is to tell us whether to use alias as prefix or not;
`useprefix = es->rtable_size > 1;` OR useprefix = (es->rtable_size > 1
|| es->verbose);.
Instead of rtable_size, we could let the new member track the fact
whether there are multiple aliases in the query (requiring multiple
prefixes) instead of size of rtable. However, the fact that the GROUP
RTE requires special handling indicates that the new RTEKind doesn't
quite fit the rest of the set. No other RTE, even if outside FROM
clause, required this special handling.

2. expandRecordVariable: The comment below the change should be
updated to explain why an output of GROUPing can not have RECORD or at
least mention GROUPING there.

3. I see code like below in get_eclass_for_sort_expr() and
mark_rels_nulled_by_join()
```
/* ignore GROUP RTE */
if (i == root->group_rtindex)
continue;
```
I assume that rel for this RTE index would be NULL, so "if" block just
below this code would get executed. I think we should just change
Assert() in that code block rather than adding a new "if" block to
avoid confusion.

4. Looking at parse_clause.c most (if not all) addRangeTableEntry*
function calls are from transform* functions. On those lines, I
expected addRangeTableEntryForGroup() to be called from
transformGroupClause(). Why are we calling
addRangeTableEntryForGroup() from parseCheckAggregates()?

5. In the parseCheckAggregates, we are replacing expressions from
targetlist and havingQual with Vars pointing to GROUP RTE. But we are
not doing that to sortClause, the remaining SQL construct. That's
because sortClause is just a list of entries pointing back to
targetlist. So there's nothing to change there. Am I right?

6. I think ParseState::p_grouping_nsitem should be collocated with
other ParseNamespaceItem members or lists in ParseState. I think it
serves a similar purpose as them. Similarly PlannerInfo::group_rtindex
should be placed next to outer_join_rels?

7. Do we need RangeTblEntry::groupexprs as a separate member? They are
the same as GROUP BY or GROUPING SET expressions. So the list can be
constructed from groupClause whenever required. Do we need to maintain
the list separately? I am comparing with other RTEs, say Subquery RTE.
We don't copy all the targetlist expressions from subquery to
subquery's RTE. I noticed that groupexprs are being treated on lines
similar to joinaliasvars. But they are somewhat different. The latter
is a unified representation of columns of joining relations different
from those columns and hence needs a new representation. That doesn't
seem to be the case with RangeTblEntry::groupexpr.

8. The change in process_sublinks_mutator() appears to be related to
the fact that GROUPING() may have subqueries which were not being
handled earlier. That change seems to be independent of the bug being
fixed here. Am I right? If yes, having those changes in a separate
patch will help.

--
Best Wishes,
Ashutosh Bapat

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Aleksander Alekseev 2024-07-04 10:04:27 Re: Recommended books for admin
Previous Message Aleksander Alekseev 2024-07-04 09:47:08 Re: Grammar guidelines in Postgres