From: | Richard Guo <guofenglinux(at)gmail(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Tender Wang <tndrwang(at)gmail(dot)com>, Paul George <p(dot)a(dot)george19(at)gmail(dot)com>, Andy Fan <zhihuifan1213(at)163(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: Eager aggregation, take 3 |
Date: | 2024-12-17 03:42:28 |
Message-ID: | CAMbWs49dLjSSQRWeud+KSN0G531ciZdYoLBd5qktXA+3JQm_UQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, Nov 1, 2024 at 2:54 PM Richard Guo <guofenglinux(at)gmail(dot)com> wrote:
> Perhaps we could introduce a GroupPathInfo into the Path structure to
> store common information for a grouped path, such as the location of
> the partial aggregation (which could be the relids of the relation on
> top of which we place the partial aggregation) and the estimated
> rowcount for this grouped path, similar to how ParamPathInfo functions
> for parameterized paths. Then we should be able to compare the
> grouped paths of the same location apples to apples. I haven’t
> thought this through in detail yet, though.
After thinking over this again, I think one difference from the
parameterized path case is that, for a parameterized path, the fewer
the required outer rels, the better, as more outer rels imply more
join restrictions. Therefore, the number of required outer rels
serves as a criterion when comparing paths in add_path().
For a grouped path, however, we don't concern ourselves with the
location of the partial aggregation. What matters is whether one
grouped path is preferable to another based on the current merits of
add_path(). Therefore, I think it's acceptable to compare grouped
paths for the same grouped rel, regardless of where the partial
aggregation is placed.
Note that non-grouped and grouped paths will not appear in the same
RelOptInfo. All paths for a grouped rel are grouped paths, meaning
there is a partial aggregation node somewhere in the path tree.
Similarly, all paths for a non-grouped rel are non-grouped paths.
That is to say, it is not possible to compare a grouped path with a
non-grouped path.
Two different grouped paths for the same grouped rel can have very
different rowcount estimates, depending on where the partial
aggregation is placed in the path tree. Therefore, for a grouped
join path, we have to calculate its rowcount estimate using its outer
and inner paths, as what we do in set_joinpath_size(). This is
similar to what we do for parameterized paths: two different
parameterized paths for the same rel can also have very different
rowcount estimates, depending on which outer rels supply the
parameters. So we calculate the rowcount estimates for parameterized
join paths for each different parameterization in
get_parameterized_joinrel_size().
set_joinpath_size() adds a special case into final_cost_nestloop(),
final_cost_mergejoin(), and final_cost_hashjoin(). For non-grouped
paths, it adds an additional check - IS_GROUPED_REL(rel), which is
defined as
#define IS_GROUPED_REL(rel) ((rel)->agg_info != NULL)
I doubt that this additional simple pointer check will cause general
performance regressions.
> Yeah, this patch does not get it correct here. Basically the logic is
> that for the partial aggregation pushed down to a non-aggregated
> relation, we need to consider all columns of that relation involved in
> upper join clauses and include them in the grouping keys. Currently,
> the patch only checks whether a column is involved in upper join
> clauses but does not verify how the column is used. We need to ensure
> that the operator used in the join clause is at least compatible with
> the grouping operator; otherwise, the grouping operator might
> interpret the values as the same while the join operator sees them as
> different.
Hmm, I think we can prevent this issue from occurring if we ensure
that "equality implies image equality" for each grouping key used in
partial aggregation. In such cases, if the grouping operator in
partial aggregation treats two values as equal, the join operator in
the upper join clause must also treat them as equal.
On the other hand, it’s possible that the grouping operator treats two
values as different, while the join operator treats them as equal.
This is fine, as the different partial groups will be combined during
the final aggregation.
Attached is the patch rebased on the latest master. It refines the
theoretical justification for the correctness of this transformation
in README and commit message. It also adds the check for image
equality for all grouping keys used in partial aggregation, and fixes
the issue reported by Jian. It does not yet handle the RLS case
though.
Thanks
Richard
Attachment | Content-Type | Size |
---|---|---|
v14-0001-Implement-Eager-Aggregation.patch | application/octet-stream | 175.4 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Jeff Davis | 2024-12-17 04:05:12 | Re: [17] CREATE SUBSCRIPTION ... SERVER |
Previous Message | wenhui qiu | 2024-12-17 03:31:54 | Re: Add 64-bit XIDs into PostgreSQL 15 |