Re: Remove useless GROUP BY columns considering unique index

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: jian he <jian(dot)universality(at)gmail(dot)com>
Cc: Zhang Mingli <zmlpostgres(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Remove useless GROUP BY columns considering unique index
Date: 2024-11-27 09:30:14
Message-ID: CAApHDvpbZ2Njs=m-8+rnEr_UjyAyECVH__kjab8SqRcGA0S=Ew@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 27 Nov 2024 at 19:51, jian he <jian(dot)universality(at)gmail(dot)com> wrote:
> v3 attached. in v3:

1. I find the following quite strange:

postgres=# create table abcd (a int primary key, b int not null, c int
not null, d int not null, unique(b));
CREATE TABLE
postgres=# explain select b,c from abcd group by b,c;
QUERY PLAN
--------------------------------------------------------------
HashAggregate (cost=37.75..56.25 rows=1850 width=8)
Group Key: b, c
-> Seq Scan on abcd (cost=0.00..28.50 rows=1850 width=8)
(3 rows)

postgres=# alter table abcd drop constraint abcd_pkey;
ALTER TABLE
postgres=# explain select b,c from abcd group by b,c;
QUERY PLAN
--------------------------------------------------------------
HashAggregate (cost=33.12..51.62 rows=1850 width=8)
Group Key: b
-> Seq Scan on abcd (cost=0.00..28.50 rows=1850 width=8)
(3 rows)

Why does the patch only try to remove GROUP BY columns using unique
indexes when there's no primary key?

I don't really see why primary key should be treated specially. Why
does the patch not just unify the logic and collect up all unique
non-expression index, non-partial indexes where all columns are NOT
NULL. You could maybe just add a special case to skip the NOT NULL
checking for indexes with indisprimary set.

2. In get_unique_not_null_attnos(), not_null_attnos could be a
Bitmapset with attnums offset by FirstLowInvalidHeapAttributeNumber.
That'll allow you to do a bms_is_member() call rather than a (possibly
expensive) list_member_int() call.

3. The logic in remove_useless_groupby_columns() looks much more
complex than it needs to be. It would be much more simple to find the
matching unique index with the least number of columns and use that
one. Just have a counter to record the bms_num_members() of the
columns of the best match so far and replace it when you find an index
with fewer columns. Please see adjust_group_pathkeys_for_groupagg()
for inspiration.

We also need to think about if the unique index with the least columns
is always the best one to use. Perhaps the index with the least
columns is a varlena column and there's some index with, say, two
INT4s. It would be much cheaper to hash a couple of INT4s than some
long varlena column. Maybe it's ok just to leave an XXX comment
explaining that we might want to think about doing that one day.
Alternatively, summing up the pg_statistic.stawidth values could work.
Likely it's better to make the patch work correctly before thinking
about that part.

The patch could also use a spell check:

"a input" (an input)
"soure" source??
"GROUOP BY"

David

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2024-11-27 09:41:40 Re: Fix for Extra Parenthesis in pgbench progress message
Previous Message Daniel Gustafsson 2024-11-27 09:16:45 Re: Proposals for EXPLAIN: rename ANALYZE to EXECUTE and extend VERBOSE