Re: Remove useless GROUP BY columns considering unique index

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Andrei Lepikhov <lepihov(at)gmail(dot)com>
Cc: Zhang Mingli <zmlpostgres(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, jian he <jian(dot)universality(at)gmail(dot)com>
Subject: Re: Remove useless GROUP BY columns considering unique index
Date: 2024-12-12 03:09:44
Message-ID: CAApHDvp-vfvkhpkkdcWADwSPNvDZN29Qxrb+x6A1775-thoW0A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, 2 Dec 2024 at 17:18, Andrei Lepikhov <lepihov(at)gmail(dot)com> wrote:
> Patch 0002 looks helpful and performant. I propose to check 'relid > 0'
> to avoid diving into 'foreach(lc, parse->rtable)' at all if nothing has
> been found.

I did end up adding another fast path there, but I felt like checking
relid > 0 wasn't as good as it could be as that would have only
short-circuited when we don't see any Vars of level 0 in the GROUP BY.
It seemed cheap enough to short-circuit when none of the relations
mentioned in the GROUP BY have multiple columns mentioned.

> NOTES:
> 1. Uniqueness is proved by a UNIQUE index. It might be a bit more
> sophisticated to probe not only grouping qual but also employ
> equivalence classes. In that case, we could process something like that:
>
> CREATE TABLE test (
> x int NOT NULL, y int NOT NULL, z int NOT NULL, w int);
> CREATE UNIQUE INDEX ON test(y,z);
> SELECT t2.z FROM test t2, test t1 WHERE t1.y=t2.y
> GROUP BY t1.y,t2.z,t2.w;

It might be worth doing something like that. It looks like we could
delay remove_useless_groupby_columns() until standard_qp_callback
anyway. Further modifications to the GROUP BY can occur there. It
might make sense to replace the call to
make_pathkeys_for_sortclauses_extended() in standard_qp_callback()
with a version of remove_useless_groupby_columns() which does both
tasks plus the one you mention.

However, I don't know what the logic would be exactly for the case you
mention as it seems possible if we start swapping columns out for
another EquivalenceMember that we might cause some regression. For
example, if you had:

create table t1(a int, b int, primary key(a,b);
create table t2(x int, y int);

select ... from t1 inner join t2 on t1.a=t2.x and t1.b = t2.y group by
t2.x,t1.b;

when how do you decide if the GROUP BY should become t1.a,t1.b or
t2.x,t2.y? It's not clear to me that using t1's columns is always
better than using t2's. I imagine using a mix is never better, but I'm
unsure how you'd decide which ones to use.

David

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2024-12-12 03:13:40 Re: pg_createsubscriber TAP test wrapping makes command options hard to read.
Previous Message David Rowley 2024-12-12 02:55:53 Re: Remove useless GROUP BY columns considering unique index