Re: Eager aggregation, take 3

From: jian he <jian(dot)universality(at)gmail(dot)com>
To: Richard Guo <guofenglinux(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Paul George <p(dot)a(dot)george19(at)gmail(dot)com>, Andy Fan <zhihuifan1213(at)163(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Eager aggregation, take 3
Date: 2024-11-06 05:04:38
Message-ID: CACJufxEwZqZcjay_2wEJfCy8fCxUJK_NOknxEpd4M+udkyp_0A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Aug 29, 2024 at 10:26 AM Richard Guo <guofenglinux(at)gmail(dot)com> wrote:
>
>
> > 2. I think there might be techniques we could use to limit planning
> > effort at an earlier stage when the approach doesn't appear promising.
> > For example, if the proposed grouping column is already unique, the
> > exercise is pointless (I think). Ideally we'd like to detect that
> > without even creating the grouped_rel. But the proposed grouping
> > column might also be *mostly* unique. For example, consider a table
> > with a million rows and a column 500,000 distinct values. I suspect it
> > will be difficult for partial aggregation to work out to a win in a
> > case like this, because I think that the cost of performing the
> > partial aggregation will not reduce the cost either of the final
> > aggregation or of the intervening join steps by enough to compensate.
> > It would be best to find a way to avoid generating a lot of rels and
> > paths in cases where there's really not much hope of a win.
> >
> > One could, perhaps, imagine going further with this by postponing
> > eager aggregation planning until after regular paths have been built,
> > so that we have good cardinality estimates. Suppose the query joins a
> > single fact table to a series of dimension tables. The final plan thus
> > uses the fact table as the driving table and joins to the dimension
> > tables one by one. Do we really need to consider partial aggregation
> > at every level? Perhaps just where there's been a significant row
> > count reduction since the last time we tried it, but at the next level
> > the row count will increase again?
> >
> > Maybe there are other heuristics we could use in addition or instead.
>
> Yeah, one of my concerns with this work is that it can use
> significantly more CPU time and memory during planning once enabled.
> It would be great if we have some efficient heuristics to limit the
> effort. I'll work on that next and see what happens.
>

in v13, latest version. we can

/* ... and initialize these targets */
if (!init_grouping_targets(root, rel, target, agg_input,
&group_clauses, &group_exprs))
return NULL;
if (rel->reloptkind == RELOPT_BASEREL && group_exprs != NIL)
{
foreach_node(Var, var, group_exprs)
{
if(var->varno == rel->relid &&
has_unique_index(rel, var->varattno))
return NULL;
}
}

since in init_grouping_targets we already Asserted that group_exprs is
a list of Var.

--------------------------------------------------------------------------------
also in create_rel_agg_info, estimate_num_groups

result->group_exprs = group_exprs;
result->grouped_rows = estimate_num_groups(root, result->group_exprs,
rel->rows, NULL, NULL);
/*
* The grouped paths for the given relation are considered useful iff
* the row reduction ratio is greater than EAGER_AGGREGATE_RATIO.
*/
agg_info->agg_useful =
(agg_info->grouped_rows <= rel->rows * (1 - EAGER_AGGREGATE_RATIO));

If the associated Var in group_exprs is too many, then result->grouped_rows
will be less accurate, therefore agg_info->agg_useful will be less accurate.
should we limit the number of Var associated with Var group_exprs.

for example:
SET enable_eager_aggregate TO on;
drop table if exists eager_agg_t1,eager_agg_t2, eager_agg_t3;
CREATE TABLE eager_agg_t1 (a int, b int, c double precision);
CREATE TABLE eager_agg_t2 (a int, b int, c double precision);
INSERT INTO eager_agg_t1 SELECT i % 100, i, i FROM generate_series(1, 5)i;
INSERT INTO eager_agg_t2 SELECT i % 10, i, i FROM generate_series(1, 5)i;
INSERT INTO eager_agg_t2 SELECT i % 10, i, i FROM generate_series(-4, -2)i;
explain(costs off, verbose, settings)
SELECT t1.a, avg(t2.c) FROM eager_agg_t1 t1 JOIN eager_agg_t2 t2 ON
abs(t1.b) = abs(t2.b % 10 + t2.a) group by 1;

explain(costs off, verbose, settings)
SELECT t1.a, avg(t2.c) FROM eager_agg_t1 t1 JOIN eager_agg_t2 t2 ON
abs(t1.b) = abs(t2.b % 10 + t2.a) group by 1;
QUERY PLAN
--------------------------------------------------------------------------------------
Finalize HashAggregate
Output: t1.a, avg(t2.c)
Group Key: t1.a
-> Merge Join
Output: t1.a, (PARTIAL avg(t2.c))
Merge Cond: ((abs(((t2.b % 10) + t2.a))) = (abs(t1.b)))
-> Sort
Output: t2.b, t2.a, (PARTIAL avg(t2.c)), (abs(((t2.b %
10) + t2.a)))
Sort Key: (abs(((t2.b % 10) + t2.a)))
-> Partial HashAggregate
Output: t2.b, t2.a, PARTIAL avg(t2.c), abs(((t2.b
% 10) + t2.a))
Group Key: t2.b, t2.a
-> Seq Scan on public.eager_agg_t2 t2
Output: t2.a, t2.b, t2.c
-> Sort
Output: t1.a, t1.b, (abs(t1.b))
Sort Key: (abs(t1.b))
-> Seq Scan on public.eager_agg_t1 t1
Output: t1.a, t1.b, abs(t1.b)
Settings: enable_eager_aggregate = 'on'
Query Identifier: -734044107933323262

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2024-11-06 05:22:58 Re: Pgoutput not capturing the generated columns
Previous Message Peter Smith 2024-11-06 04:55:48 Re: Pgoutput not capturing the generated columns