Re: POC: GROUP BY optimization

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: Teodor Sigaev <teodor(at)sigaev(dot)ru>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC: GROUP BY optimization
Date: 2018-06-06 23:06:19
Message-ID: 24614373-916b-a911-834b-dc7cb1b447a3@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 06/07/2018 12:18 AM, Claudio Freire wrote:
> On Wed, Jun 6, 2018 at 6:57 PM Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>> On 06/06/2018 11:22 PM, Claudio Freire wrote:
>>> On Wed, Jun 6, 2018 at 5:43 PM Tomas Vondra
>>> As such, estimating sort performance correctly in the various plan
>>> variants being considered seems to be a very central aspect of it.
>>>
>>> This means it's pretty much time to consider the effect of operator
>>> cost *and* distinct values in the cost computation.
>>>
>>
>> Yes, until now that was not really needed because the optimizer does not
>> consider reordering of the pathkeys - it simply grabs either GROUP BY or
>> ORDER BY pathkeys and runs with that.
>>
>> So the costing was fairly trivial, we simply do something like
>>
>> comparison_cost = 2.0 * cpu_operator_cost;
>>
>> sort_cost = comparison_cost * tuples * LOG2(tuples);
>>
>> which essentially ignores that there might be multiple columns, or that
>> the columns may have sort operator with different costs.
>>
>> The question is how reliable the heuristics can be. The current patch
>> uses just plain ndistinct, but that seems rather unreliable but I don't
>> have a clear idea how to improve that - we may have MCV for the columns
>> and perhaps some extended statistics, but I'm not sure how far we can
>> run with that.
>>
>> Essentially what we need to estimate the number of comparisons for
>> each column, to compute better comparison_cost.
>
> This could be approached by adjusting statistics by any relevant
> restrictions applicable to the columns being grouped, as is done for
> selectivity estimations.
>

I don't follow how is this related to restrictions? Can you elaborate?

> I'm not sure how far that would get us, though. It would be rather
> easy to lose sight of those restrictions if there are complex
> operations involved.
>
>>> Assuming cost_sort does its thing, it's just a matter of building the
>>> desired variants and picking the one with the smallest cost_sort. One
>>> can use heuristics to build a subset of all possible column orderings
>>> with a guiding principle that tries to maximize the likelihook of
>>> finding a better order than the one in the query (like sorting by
>>> ndistinct), but I'd suggest:
>>>
>>> - Always considering the sort order verbatim as given in the SQL (ie:
>>> what the user requests)
>>> - Relying on cost_sort to distinguish among them, and pick a winner, and
>>> - When in doubt (if cost_sort doesn't produce a clear winner), use the
>>> order given by the user
>>>
>>
>> I don't see why not to generate all possible orderings (keeping only the
>> cheapest path for each pathkey variant) and let the optimizer to sort it
>> out.
>
> I'm assuming costing the full N! possible orderings would be
> prohibitively expensive.
>

That's possible, yes - particularly for large N values. Perhaps there's
a simpler algorithm to compute a sufficient approximation. In a greedy
way, starting from some likely-good combination of columns, or so. I
haven't thought about this part very much ...

>> If the user specified an explicit ORDER BY, we'll slap an extra
>> Sort by at the end, increasing the cost.
>
> I think you misunderstood me. I'm saying the exact opposite.
>
> When I mention handicap, I mean to give the explicitly requested group
> by order a *reduced* cost, to give it an advantage over the
> heuristics.
>

I did understand you. I'm saying that if the ordering does not match the
one requested by the user (in ORDER BY), we end up adding an explicit
Sort node on top of it, which increases the cost of all the other paths,
acting as a disadvantage.

But now I realize this only works for when there is an ORDER BY clause,
not for GROUP BY (in which case we don't add any Sort node).

> This is to try to attack the problem you mentioned where users already
> accounting for operator costs in their queries would get overridden by
> the planner, perhaps in detriment of overall execution time.
>
> In essence:
>
> - If the user requested that order, we assume it "somewhat
> subjectively better" (by giving it a slightly reduced cost).
> - If there is an explicit order by clause that differs from a
> considered path, the required sort node will already penalize it
> appropriately, nothing needs to be done in relation to sort costs.
> - All other things being equal, cost_sort will make the decision. If a
> plan beats the user-provided order in spite of the handicap, it will
> be used. So it's still optimizing clear cases.
>

OK. I still don't believe this is a good approach, because we don't know
how good our estimate of the costs is, so I have no idea how good large
the handicap needs to be.

>>
>>> Comparison cost can be approximated probabilistically as:
>>>
>>> cost_comp = sum(cost_op_n * (1.0 / ndistinct(col_1_to_n)))
>>>
>>> Where cost_op_n is the cost of the comparison function for column N,
>>> and ndistinct(col_1_to_n) is an estimation of the number of distinct
>>> values for columns prior to N in the sort order.
>>>
>>> You can approximate ndistinct as the product of ndistinct of previous
>>> columns, or use extended statistics when available.
>>>
>>
>> Sure. The trouble is this also assumes uniform distribution, which is
>> not always the case.
>
> Well, (1.0 / ndistinct) = p(left == right).
>
> If you can compute a better p(left == right) with an MCV, you can get
> a better estimate. If you have an MCV. But that'd be a bit expensive,
> and I'm not sure it's all that relevant.
>
> To what degree does improving this produce better plans? As long as
> average expected cost is relatively well estimated (as in, one sort
> order relative to another sort order), it won't affect the decision.
>

I'd bet I can construct corner-case examples with vastly different
behavior depending on column data distributions, so it's not entirely
insignificant. How important is it in practice I don't know.

> But if needed, the MCV can be used for this.
>
> So, in essence, you need to account for:
>
> - restrictions on that column that constrain the domain
> - distribution skew (MCV, nulls)
> - ndistinct
>
> To compute p(left == right).
>
> Maybe something as simple as the following?
>
> p_special = sum(mcv_i * mcv_i) + null_frac * null_frac
> p_other = (1 - p_special) * (1 - p_special) / ndistinct(restr)
>
>>> I think that should give a good enough approximation of
>>> ndistinct-enriched sort costs that considers operator cost
>>> appropriately. That operator cost is basically an average and can be
>>> used as a constant, so it's just a matter of passing the right
>>> comparison_cost to cost_sort.
>>>
>>> Even if ndistinct estimates are off, the fact that operator cost is
>>> involved should be a good enough tool for the user should the
>>> planner pick the wrong path - it's only a matter of boosting operator
>>> cost to let the planner know that sorting that way is expensive.
>>>
>>
>> There are far to many "should" in these two paragraphs.
>
> Aren't all planner decisions riddled with "should"s?
>
> Anyway, my point is that, since the user can control operator cost,
> and operator cost is involved in the computation, and the ordering
> given by the user is explicitly tested for and given a slight
> advantage, it should (will?) be easy for the user to overcome any poor
> planner decisions by giving the planner the relevant information.
>

I really don't want to force users to tweak operator cost in order to
tune queries. It's about an order of magnitude less convenient than a
GUC, for various reasons.

>>> Priorization of the user-provided order can be as simple as giving
>>> that comparison_cost a small handicap.
>>
>> I see no point in doing that, and I don't recall a single place in the
>> planner where we do that. If the user specified ORDER BY, we'll slap an
>> explicit Sort on top when needed (which acts as the handicap, but in a
>> clear way). Otherwise we don't do such things - it'd be just plain
>> confusing (consider "ORDER BY a,b" vs. "ORDER BY b,c" with same data
>> types, ndistinct etc. but unexpectedly different costs). Also, what
>> would be a good value for the handicap?
>
> As I said, I'm not talking about explicit order by, but about the case
> where the user knows which group by order is the most efficient
> somehow.
>
> I'm proposing the planner shouldn't override the user without clear
> evidence that the other plan is actually better, above the noise
> better. That's what I mean by handicap.
>

OK, gotcha. But I'm really unsure how large the handicap should be.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2018-06-06 23:11:00 Re: Why is fncollation in FunctionCallInfoData rather than fmgr_info?
Previous Message MauMau 2018-06-06 22:39:34 Re: I'd like to discuss scaleout at PGCon