Re: POC: GROUP BY optimization

From: jian he <jian(dot)universality(at)gmail(dot)com>
To: Andrei Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>
Cc: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Richard Guo <guofenglinux(at)gmail(dot)com>, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, David Rowley <dgrowleyml(at)gmail(dot)com>, "a(dot)rybakina" <a(dot)rybakina(at)postgrespro(dot)ru>
Subject: Re: POC: GROUP BY optimization
Date: 2024-05-20 08:54:56
Message-ID: CACJufxFoswKvkBi6putzLw67QCYsP+m9d3NLcbhkAj9Di=8eVQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, May 16, 2024 at 3:47 PM Andrei Lepikhov
<a(dot)lepikhov(at)postgrespro(dot)ru> wrote:
>
> On 24.04.2024 13:25, jian he wrote:
> > hi.
> > I found an interesting case.
> >
> > CREATE TABLE t1 AS
> > SELECT (i % 10)::numeric AS x,(i % 10)::int8 AS y,'abc' || i % 10 AS
> > z, i::int4 AS w
> > FROM generate_series(1, 100) AS i;
> > CREATE INDEX t1_x_y_idx ON t1 (x, y);
> > ANALYZE t1;
> > SET enable_hashagg = off;
> > SET enable_seqscan = off;
> >
> > EXPLAIN (COSTS OFF) SELECT count(*) FROM t1 GROUP BY x,z,y,w;
> > EXPLAIN (COSTS OFF) SELECT count(*) FROM t1 GROUP BY x,w,y,z;
> > EXPLAIN (COSTS OFF) SELECT count(*) FROM t1 GROUP BY x,z,w,y;
> > EXPLAIN (COSTS OFF) SELECT count(*) FROM t1 GROUP BY x,w,z,y;
> > the above part will use:
> > -> Incremental Sort
> > Sort Key: x, $, $, $
> > Presorted Key: x
> > -> Index Scan using t1_x_y_idx on t1
> >
> > EXPLAIN (COSTS OFF) SELECT count(*) FROM t1 GROUP BY z,y,w,x;
> > EXPLAIN (COSTS OFF) SELECT count(*) FROM t1 GROUP BY w,y,z,x;
> > EXPLAIN (COSTS OFF) SELECT count(*) FROM t1 GROUP BY y,z,x,w;
> > EXPLAIN (COSTS OFF) SELECT count(*) FROM t1 GROUP BY y,w,x,z;
> > EXPLAIN (COSTS OFF) SELECT count(*) FROM t1 GROUP BY y,x,z,w;
> > EXPLAIN (COSTS OFF) SELECT count(*) FROM t1 GROUP BY y,x,w,z;
> >
> > these will use:
> > -> Incremental Sort
> > Sort Key: x, y, $, $
> > Presorted Key: x, y
> > -> Index Scan using t1_x_y_idx on t1
> >
> > I guess this is fine, but not optimal?
> It looks like a bug right now - in current implementation we don't
> differentiate different orders. So:
> 1. Applying all the patches from the thread which I proposed as an
> answer to T.Lane last rebuke - does behavior still the same?.

I've applied these 4 patches in this thread.
final_improvements.patch
minor_comment.patch
get_useful_group_keys_orderings.patch
group_by_asymmetry.diff

The behavior is still the same as the master.
meaning that below quoted queries are still using "Presorted Key: x".

> > EXPLAIN (COSTS OFF) SELECT count(*) FROM t1 GROUP BY x,z,y,w;
> > EXPLAIN (COSTS OFF) SELECT count(*) FROM t1 GROUP BY x,w,y,z;
> > EXPLAIN (COSTS OFF) SELECT count(*) FROM t1 GROUP BY x,z,w,y;
> > EXPLAIN (COSTS OFF) SELECT count(*) FROM t1 GROUP BY x,w,z,y;
> > the above part will use:
> > -> Incremental Sort
> > Sort Key: x, $, $, $
> > Presorted Key: x
> > -> Index Scan using t1_x_y_idx on t1

> 2. Could you try to find the reason?
the following are my understanding, it may be wrong...

function get_useful_group_keys_orderings, may generate alternative
pathkeys and group clauses.
for different PathKeyInfo
sub functions within add_paths_to_grouping_rel:
make_ordered_path, create_agg_path will yield the same cost.
function add_path (within add_paths_to_grouping_rel) will add these
paths (different PathKeyInfos) in a *sequential* order.

then later in the function set_cheapest.
`foreach(p, parent_rel->pathlist)` will iterate these different paths
in a sequential order.
so in set_cheapest if 2 path costs are the same, then the first path
residing in parent_rel->pathlist wins.

fo a concrete example:
CREATE TABLE t1 AS
SELECT (i % 10)::numeric AS x,(i % 10)::int8 AS y,'abc' || i % 10 AS
z, i::float4 AS w FROM generate_series(1, 100) AS i;
CREATE INDEX t1_x_y_idx ON t1 (x, y);
ANALYZE t1;
SET enable_hashagg = off;
SET enable_seqscan = off;
EXPLAIN (COSTS OFF, verbose) SELECT count(*) FROM t1 GROUP BY x,z,y,w;

add_paths_to_grouping_rel will do the following in a sequential order.
A. generate and add original PathKeyInfo(groupby x,z,y,w)
B. generate and add alternative PathKeyInfo(groupby x,y,z,w). (because
of the index path)
C. can_hash is true, generate a HashAgg Path (because of
enable_hashagg is set to off, so this path the cost is very high)

As mentioned previously,
both A and B can use Incremental Sort, set_cheapest will choose the A
instead of B,
because A step generated path **first** satisfies increment sort.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2024-05-20 08:58:30 Re: Pgoutput not capturing the generated columns
Previous Message Andy Fan 2024-05-20 08:52:07 Re: using extended statistics to improve join estimates