Re: Eager aggregation, take 3

From: Richard Guo <guofenglinux(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: jian he <jian(dot)universality(at)gmail(dot)com>, Tender Wang <tndrwang(at)gmail(dot)com>, Paul George <p(dot)a(dot)george19(at)gmail(dot)com>, Andy Fan <zhihuifan1213(at)163(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Eager aggregation, take 3
Date: 2024-11-01 06:21:06
Message-ID: CAMbWs488eecPb0qk9c9k6RQNYtnHcm4YVQch+ppmf5jpZHGH6g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Oct 31, 2024 at 12:25 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Well, the key thing here is the reasoning, which you don't really
> spell out either here or in the patch. The patch's justification for
> the use of BTEQUALIMAGE_PROC Is that, if we use non-equal-image
> operators, "we may lose some information that could be needed to
> evaluate upper qual clauses." But there's no explanation of why that
> would happen, and I think that makes it difficult to have a good
> discussion.
>
> Here's an example from the optimizer README:
>
> # 3. (A leftjoin B on (Pab)) leftjoin C on (Pbc)
> # = A leftjoin (B leftjoin C on (Pbc)) on (Pab)
> #
> # Identity 3 only holds if predicate Pbc must fail for all-null B rows
> # (that is, Pbc is strict for at least one column of B). If Pbc is not
> # strict, the first form might produce some rows with nonnull C columns
> # where the second form would make those entries null.
>
> This isn't the clearest justification for a rule that I've ever read,
> but it's something. Someone reading this can think about whether the
> two sentences of justification are a correct argument that Pbc must be
> strict in order for the identity to hold.
>
> I think you should be trying to offer similar (or better)
> justifications, and perhaps similar identities as well. It's a little
> bit hard to think about how to write the identities for this patch,
> although you could start with aggregate(X) =
> finalizeagg(partialagg(X)) and then explain how the partialagg can be
> commuted with certain joins and what is required for it to do so. The
> argument needs to be detailed enough that we can evaluate whether it's
> correct or not.
>
> Personally, I'm still stuck on the point I wrote about yesterday. When
> you partially aggregate a set of rows, the resulting row serves as a
> proxy for the original set of rows. So, all of those rows must have
> the same destiny. As I mentioned then, if you ever partially aggregate
> a row that should ultimately be filtered out with one that shouldn't,
> that breaks everything. But also, I need all the rows that feed into
> each partial group to be part of the same set of output rows. For
> instance, suppose I run a report like this:
>
> SELECT o.order_number, SUM(l.quantity) FROM orders o, order_lines l
> WHERE o.order_id = l.order_id GROUP BY 1;
>
> If I'm thinking of executing this as FinalizeAgg(order JOIN
> PartialAgg(order_lines)), what must be true in order for that to be a
> valid execution plan? I certainly can't aggregate on some unrelated
> column, say part_number. If I do, then first of all I don't know how
> to get an order_id column out so that I can still do the join. But
> even if that were no issue, you might have two rows with different
> values of the order_id column and the same value for part_number, and
> that the partial groups that you've created on the order_lines table
> don't line up properly with the groups that you need to create on the
> orders table. Some particular order_id might need some of the rows
> that went into a certain part_number group and not others; that's no
> good.
>
> After some thought, I think the process of deduction goes something
> like this. We ultimately need to aggregate on a column in the orders
> table, but we want to partially aggregate the order_lines table. So we
> need a set of columns in the order_lines table that can act as a proxy
> for o.order_number. Our proxy should be all the columns that appear in
> the join clauses between orders and order_lines. That way, all the
> rows in a single partially aggregate row will have the "same" order_id
> according to the operator class implementing the = operator, so for a
> given row in the "orders" table, either every row in the group has
> o.order_id = l.order_id or none of them do; hence they're all part of
> the same set of output rows.
>
> It doesn't matter, for example, whether o.order_number is unique. If
> it isn't, then we'll flatten together some different rows from the
> orders table -- but each of those rows will match all the rows in
> partial groupings where o.order_id = partially_grouped_l.order_id and
> none of the rows where that isn't the case, so the optimization is
> still valid. Likewise it doesn't matter whether o.order_id is unique:
> if order_id = 1 occurs twice, then both of the rows will match the
> partially aggregated group with order_id = 1, but that's fine. The
> only thing that's a problem is if the same row from orders was going
> to match some but not all of the partially aggregate rows from some
> order_lines group, and that won't happen here.
>
> Now consider this example:
>
> SELECT o.order_number, SUM(l.quantity) FROM orders o, order_lines l
> WHERE o.order_id = l.order_id AND o.something < l.something GROUP BY
> 1;
>
> Here, we cannot partially group order_lines on just l.order_id,
> because we might have a row in the orders table with order_id = 1 and
> something = 1 and two rows in the order_lines table that both have
> order_id = 1 but one has something = 0 and the other has something =
> 2. The orders row needs to join to one of those but not the other, so
> we can't put them in the same partial group. However, it seems like it
> would be legal to group order_lines on (order_id, something), provided
> that the operator classes we use for the grouping operation matches
> the operator classes of the operators in the join clause - i.e. we
> group on order_id using the operator class of = and on something using
> the operator class of <. If the operator classes don't match in this
> way, then it could happen that the grouping operator thinks the values
> are the same but the join operator thinks they're different.
> (Everything is also OK if the grouping operator tests
> bitwise-equality, because then nothing can ever get merged that
> shouldn't; but other notions of equality are also fine as long as
> they're compatible with the operator actually used.)
>
> Now let's consider this example, using an imaginary user-defined operator:
>
> SELECT o.order_number, SUM(l.quantity) FROM orders o, order_lines l
> WHERE o.order_id = l.order_id AND o.something ?!?! l.something GROUP
> BY 1;
>
> Here, we can partially aggregate order_lines by (order_id, something)
> as long as order_id is grouped using bitwise equality OR the same
> operator class as the = operator used in the query, and something has
> to use bitwise equality.
>
> What about this:
>
> SELECT o.order_number, SUM(l.quantity) FROM orders o LEFT JOIN
> order_lines l ON o.order_id = l.order_id AND l.something = 1 GROUP BY
> 1;
>
> It's really important that we don't try to aggregate on just
> l.order_id here. Some rows in the group might have l.something = 1 and
> others not. It would be legal (but probably not efficient) to
> aggregate order_lines on (order_id, something).
>
> So it seems to me that the general rule here is that when the table
> for which we need a surrogate key is directly joined to the table
> where the original grouping column is located, we need to group on all
> columns involved in the join clause, using either compatible operators
> or bitwise equality operators. As a practical matter, we probably only
> want to do the optimization when all of the join clauses are
> equijoins. Then we don't need to worry about bitwise equality at all,
> AFAICS; we just group using the operator class that contains the
> operator specified by the user.
>
> What if there are more than two tables involved, like this:
>
> SELECT o.order_number, MAX(n.creation_time) FROM orders o, order_lines
> l, order_line_notes n WHERE o.order_id = l.order_id AND o.note_id =
> n.note_id GROUP BY 1;
>
> Here, there's no direct connection between the table with the original
> grouping column and the table we want to aggregate. Using the rule
> mentioned above, we can deduce that l.order_id is a valid grouping
> column for order_lines. Applying the same rule again, we can then
> deduce that n.note_id is a valid grouping column for note_id. We could
> either group note_id on n and then perform the remaining joins; or we
> could join order_lines and note_id and then partially aggregate the
> result on l.order_id.
>
> What if there are multiple grouping columns, like this:
>
> SELECT foo.x, bar.y, SUM(baz.z) FROM foo, bar, baz WHERE foo.a = baz.a
> AND bar.b = baz.b GROUP BY 1, 2;
>
> In this case, we need to find a surrogate grouping column for foo.x
> and also a surrogate grouping column for bar.y. We can group baz on a,
> b; but not just on a or just on b.
>
> Finally, I think this example is interesting:
>
> SELECT p.title, SUM(c.stuff) FROM parent p, link l, child c WHERE p.x
> = l.x AND l.y = c.y AND p.z = c.z GROUP BY 1;
>
> Based on the rule that I articulated earlier, you might think we can
> just look at the join clauses between parent and child and group the
> child on c.z. However, that's not correct -- we'd have to group on
> both c.x and c.z. I'm not sure how to adjust the rule so that it
> produces the right answer here.

Thank you for the thorough deduction process; this is something I
should have done before proposing the patch. As we discussed
off-list, what I need to do next is to establish a theory that proves
the transformation proposed in this patch is correct in all cases.

What I have in mind now is that to push a partial aggregation down to
a relation, we need to group by all the columns of that relation
involved in the upper join clauses, using compatible operators. This
is essential to ensure that an aggregated row from the partial
aggregation matches the other side of the join if and only if each row
in the partial group does, thereby ensuring that all rows in the same
partial group have the same 'destiny'.

Thanks
Richard

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2024-11-01 06:27:38 Re: define pg_structiszero(addr, s, r)
Previous Message Richard Guo 2024-11-01 05:54:28 Re: Eager aggregation, take 3