From: | Richard Guo <guofenglinux(at)gmail(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Robert Haas <robertmhaas(at)gmail(dot)com>, Tender Wang <tndrwang(at)gmail(dot)com>, Paul George <p(dot)a(dot)george19(at)gmail(dot)com>, Andy Fan <zhihuifan1213(at)163(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: Eager aggregation, take 3 |
Date: | 2025-01-17 12:19:43 |
Message-ID: | CAMbWs48HRvZTsSyLavkYXuo3otqnNQ7jdBO9iR-05SQirSiKSg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, Jan 17, 2025 at 6:40 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> * The README addition, and the basically identical text in the
> commit message, don't even provide a reason to believe that the
> transformation is correct let alone that it will result in faster
> execution. I don't understand why it's so hard to provide a solid
> correctness argument. This work was supposedly based on an academic
> paper; surely that paper must have included a correctness proof?
> PG might need a few refinements, like being specific about what we
> expect from the equality operators. But an EXPLAIN plan is not an
> argument.
Thank you for taking a look at this patch!
In README, I provided the justification for the correctness of this
transformation as follows:
For the partial aggregation that is pushed down to a non-aggregated
relation, we need to consider all expressions from this relation that
are involved in upper join clauses and include them in the grouping
keys, using compatible operators. This is essential to ensure that an
aggregated row from the partial aggregation matches the other side of
the join if and only if each row in the partial group does. This
ensures that all rows within the same partial group share the same
'destiny', which is crucial for maintaining correctness.
I believed that this explanation would make it clear why this
transformation is correct.
Yeah, this work implements one of the transformations introduced in
paper "Eager Aggregation and Lazy Aggregation". In the paper, Section
4 presents the formalism, Section 5 proves the main theorem, and
Section 6 introduces corollaries related to this specific
transformation. I'm just not sure how to translate these theorems and
corollaries into natural language that would be suitable to be
included in the README. I can give it another try if you find the
above justification not clear enough, but it would be really helpful
if I could get some assistance with this.
And I'd like to clarify that the EXPLAIN plan included in the README
is only meant to illustrate how this transformation looks like, and is
not intended to serve as an argument for its correctness.
> * As for the performance aspect, we're given
>
> Finalize HashAggregate
> Group Key: a.i
> -> Nested Loop
> -> Partial HashAggregate
> Group Key: b.j
> -> Seq Scan on b
> -> Index Only Scan using a_pkey on a
> Index Cond: (i = b.j)
>
> As far as I can see, this will require aggregation to be performed
> across every row of "b", whereas the naive way would have aggregated
> across only rows having join partners in "a".
Yes, that's correct.
> If most "b" rows lack
> a join partner then this will be far slower than the naive way.
No, this is not correct. The partial aggregation may reduce the
number of input rows to the join, and the resulting data reduction
could justify the cost of performing the partial aggregation. As an
example, please consider:
create table t1 (a int, b int, c int);
create table t2 (a int, b int, c int);
insert into t1 select i%3, i%3, i from generate_series(1,1000000)i;
insert into t2 select i%3+3, i%3+3, i from generate_series(1,1000000)i;
analyze t1, t2;
explain analyze
select sum(t2.c) from t1 join t2 on t1.b = t2.b group by t1.a;
So for this query, most (actually all) t2 rows lack a join partner.
Running it with and without eager aggregation, I got (best of 3):
-- with eager aggregation
Execution Time: 496.856 ms
-- without eager aggregation
Execution Time: 1723.844 ms
> I do see that it can be better if most "b" rows have multiple join
> partners, because we'll re-use partial aggregation results instead
> of (effectively) recalculating them.
Not only because we'll re-use partial aggregation results, but also
(and perhaps more importantly) because the number of input rows to the
join could be significantly reduced.
> But the README text makes it
> sound like this is an unconditional win, which is not the right
> mindset.
I'm sorry if the README text gives that impression. The README says:
If the partial aggregation on table B significantly reduces the number
of input rows, the join above will be much cheaper, leading to a more
efficient final plan.
Perhaps I should use "could" or "might" instead of "will" to make it
less misleading.
But as you can see from the implementation, the decision is entirely
based on cost, not on rules. There is no part of the code that ever
assumes this transformation is an unconditional win.
> (In fact, in this specific example where a.i is presumed
> unique, how's it a win at all?)
It seems to me that whether it's a win depends on whether b.j is a
column with low cardinality (i.e., relatively few unique values). I
don't really see how a.i being unique would change that. Please
see the example below:
create table a (i int primary key, x int);
create table b (j int, y int);
insert into a select i, i%3 from generate_series(1,10000)i;
insert into b select i%3, i from generate_series(1,10000)i;
analyze a, b;
set enable_eager_aggregate to off;
EXPLAIN (ANALYZE, COSTS OFF)
SELECT a.i, avg(b.y)
FROM a JOIN b ON a.i > b.j
GROUP BY a.i;
QUERY PLAN
--------------------------------------------------------------------------------------------------
HashAggregate (actual time=100257.254..100268.841 rows=10000 loops=1)
Group Key: a.i
Batches: 1 Memory Usage: 2193kB
Buffers: shared hit=133
-> Nested Loop (actual time=2.629..40849.630 rows=99990000 loops=1)
Buffers: shared hit=133
-> Seq Scan on b (actual time=0.450..10.066 rows=10000 loops=1)
Buffers: shared hit=45
-> Memoize (actual time=0.002..0.752 rows=9999 loops=10000)
Cache Key: b.j
Cache Mode: binary
Hits: 9997 Misses: 3 Evictions: 0 Overflows: 0
Memory Usage: 1055kB
Buffers: shared hit=88
-> Index Only Scan using a_pkey on a (actual
time=0.752..8.100 rows=9999 loops=3)
Index Cond: (i > b.j)
Heap Fetches: 0
Buffers: shared hit=88
Planning Time: 1.681 ms
Execution Time: 100273.011 ms
(19 rows)
set enable_eager_aggregate to on;
EXPLAIN (ANALYZE, COSTS OFF)
SELECT a.i, avg(b.y)
FROM a JOIN b ON a.i > b.j
GROUP BY a.i;
QUERY PLAN
--------------------------------------------------------------------------------------------
Finalize HashAggregate (actual time=77.701..90.680 rows=10000 loops=1)
Group Key: a.i
Batches: 1 Memory Usage: 2193kB
Buffers: shared hit=133
-> Nested Loop (actual time=27.586..52.352 rows=29997 loops=1)
Buffers: shared hit=133
-> Partial HashAggregate (actual time=27.408..27.419 rows=3 loops=1)
Group Key: b.j
Batches: 1 Memory Usage: 24kB
Buffers: shared hit=45
-> Seq Scan on b (actual time=0.173..3.767 rows=10000 loops=1)
Buffers: shared hit=45
-> Index Only Scan using a_pkey on a (actual
time=0.108..5.277 rows=9999 loops=3)
Index Cond: (i > b.j)
Heap Fetches: 0
Buffers: shared hit=88
Planning Time: 1.739 ms
Execution Time: 93.003 ms
(18 rows)
There is a performance improvement of ~1000 times, even though a.i is
unique.
# select 100273.011/93.003;
?column?
-----------------------
1078.1696396890422890
(1 row)
(I used 'a.i > b.j' instead of 'a.i = b.j' to make the performance
difference more noticeable. I believe this is fine, as it doesn't
undermine the fact that a.i is unique.)
> * I'm also concerned about what happens with aggregates that can have
> large partial-aggregation values, such as string_agg(). With the
> existing usage of partial aggregation for parallel queries, it's
> possible to be confident that there are not many partial-aggregation
> values in existence at the same time. I don't think that holds for
> pushed-down aggregates: for example, I wouldn't be surprised if the
> planner chooses a join plan that requires stuffing all those values
> into a hash table, or "materializes" the output of the partial
> aggregation step. Do we have logic that will avoid blowing out
> memory during such queries?
Good point! Thank you for bringing this up. I hadn't considered it
before, and it seems no one else has raised this issue. I'll look
into it.
> * I am just as worried as Robert is about the notion of different
> paths for the same RelOptInfo having different rowcount estimates.
> That is an extremely fundamental violation of basic planner
> assumptions. We did bend it for parameterized paths by restating
> those assumptions as (from optimizer/README):
>
> To keep cost estimation rules relatively simple, we make an implementation
> restriction that all paths for a given relation of the same parameterization
> (i.e., the same set of outer relations supplying parameters) must have the
> same rowcount estimate. This is justified by insisting that each such path
> apply *all* join clauses that are available with the named outer relations.
>
> I don't see any corresponding statement here, and it's not clear
> to me that the point has been thought through adequately.
>
> Another aspect that bothers me is that a RelOptInfo is understood
> to contain a bunch of paths that all yield the same data (the same
> set of columns), and it seems like that might not be the case here.
> Certainly partially-aggregated paths will output something different
> than unaggregated ones, but mightn't different join orders mutate the
> column set even further?
>
> I think that we might be better off building a separate RelOptInfo for
> each way of pushing down the aggregates, in order to preserve the
> principle that all the paths in any one RelOptInfo have the same
> output. This'll mean more RelOptInfos, but not more paths, so
> I doubt it adds that much performance overhead.
Hmm, IIUC, this means that we would separate the grouped paths of the
same grouped relation into different RelOptInfos based on the location
of the partial aggregation within the path tree. Let's define the
"location" as the relids of the relation on top of which we place the
partial aggregation. For grouped relation {A B C D}, if we perform
some aggregation on C, we would end up with 8 diffent grouped paths:
{A B D PartialAgg(C)}
{B D PartialAgg(A C)}
{A D PartialAgg(B C)}
{A B PartialAgg(D C)}
{D PartialAgg(A B C)}
{B PartialAgg(A D C)}
{A PartialAgg(B D C)}
{PartialAgg(A B D C)}
That means we would need to create 8 RelOptInfos for this grouped
relation. If my math doesn't fail me, for a relation containing n
base rels, we would need to create 2^(n-1) different RelOptInfos.
When building grouped relation {A B C D E} by joining {A B C D} with
{E}, we would need to call make_grouped_join_rel() 8 times, each time
joining {E} with one of the 8 RelOptInfos mentioned above. And at
last, considering other join orders such as joining {A B C E} with
{D}, this new grouped relation would end up with 16 new RelOptInfos.
And then we proceed with building grouped relation {A B C D E F}, and
end up with 32 new RelOptInfos, and this process continues...
It seems to me that this doesn't only result in more RelOptInfos, it
can also lead to more paths. Consider two grouped paths, say P1 and
P2, for the same grouped relation, but with different locations of the
partial aggregation. Suppose P1 is cheaper, at least as well ordered,
generates no more rows, requires no outer rels not required by P2, and
is no less parallel-safe. If these two paths are kept in the same
RelOptInfo, P2 will be discarded and not considered in further
planning. However, if P1 and P2 are separated into different
RelOptInfos, and P2 happens to have survived the add_path() tournament
for the RelOptInfo it is in, then it will be considered in subsequent
planning steps.
So in any case, this doesn't seem like a feasible approach to me.
I also have some thoughts on grouped paths and parameterized paths,
but I've run out of time for today. I'll send a separate email.
I'm really glad you're taking a look at this patch. Thank you!
Thanks
Richard
From | Date | Subject | |
---|---|---|---|
Next Message | Vladimir Sitnikov | 2025-01-17 12:22:18 | Re: Add a property to automatically suspend portals as they produce given number of bytes |
Previous Message | Rafia Sabih | 2025-01-17 12:03:09 | Re: Bypassing cursors in postgres_fdw to enable parallel plans |