Re: Eager aggregation, take 3

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Richard Guo <guofenglinux(at)gmail(dot)com>, Tender Wang <tndrwang(at)gmail(dot)com>, Paul George <p(dot)a(dot)george19(at)gmail(dot)com>, Andy Fan <zhihuifan1213(at)163(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Eager aggregation, take 3
Date: 2025-01-20 17:57:20
Message-ID: 3016309.1737395840@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> So I don't quite know which way to jump here. It now seems to me that
> we have three similar features with three different designs.
> Parameterization added non-comparable paths to the same path list;
> parallel query added them to a different path list in the same
> RelOptInfo; and this patch currently adds them a separate RelOptInfo.

Yeah, this. I don't think that either of those first two decisions
was wrong, but it does seem annoying that this patch wants to do it
yet a third way. Still, it may be the right thing. Bear with me a
moment:

We dealt with parameterized paths being in the same list as
non-parameterized paths by treating the set of parameter rels as a
figure-of-merit that add_path can compare. This works because if,
say, a nonparameterized path dominates a parameterized one on every
other figure of merit then there's no point in keeping the
parameterized one. It is squirrely that the parameterized paths
typically don't yield the same number of rows as others for the same
RelOptInfo, but at least so far that hasn't broken anything. I think
it's important that the parameterized paths do yield the same column
set as other paths for the rel; and the rows they do yield will be a
subset of the rows that nonparameterized paths yield.

On the other hand, it's not sensible for partial paths to compete
in an add_path tournament with non-partial ones. If they did, neither
group could be allowed to dominate the other group, so add_path would
just be wasting its time making those path comparisons. So I do think
it was right to put them in a separate path list. Importantly, they
generate the same column set and some subset of the same rows that
the non-partial ones do, which I think is what justifies putting
them into the same RelOptInfo.

However, a partial-aggregation path does not generate the same data
as an unaggregated path, no matter how fuzzy you are willing to be
about the concept. So I'm having a very hard time accepting that
it ought to be part of the same RelOptInfo, and thus I don't really
buy that annotating paths with a GroupPathInfo is the way forward.

What this line of analysis doesn't tell us though is whether paths
that did their partial aggregations at different join levels can be
considered as enough alike that they should compete on cost terms.
If they are, we need to put them into the same RelOptInfo. So
while I want to have separate RelOptInfos for partially aggregated
paths, I'm unclear on how many of those we need or what their
identifying property is.

Also: we avoid generating parameterized partial paths, because
combining those things would be too much of a mess. There's some
handwaving in the comments for add_partial_path to the effect that
it wouldn't be a win anyway, but I think the real reason is that
it'd be far too complicated for the potential value. Can we make
a similar argument for partial aggregation? I sure hope so.

> I agree that creating an exponential number of RelOptInfos is not
> going to work out well.

FWIW, I'm way more concerned about the number of Paths considered
than I am about the number of RelOptInfos. This relates to your
question about whether we want to use some heuristics to limit
the planner's search space.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Sami Imseih 2025-01-20 18:26:13 Re: Bug in detaching a partition with a foreign key.
Previous Message Jacob Champion 2025-01-20 17:53:38 Re: Log connection establishment timings