Re: POC: GROUP BY optimization

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc: Teodor Sigaev <teodor(at)sigaev(dot)ru>, Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>, Andres Freund <andres(at)anarazel(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: POC: GROUP BY optimization
Date: 2019-05-24 22:57:25
Message-ID: 20190524225725.embuha33qvc5avz3@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, May 24, 2019 at 02:10:48PM +0200, Dmitry Dolgov wrote:
>> On Fri, May 3, 2019 at 11:55 PM Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>> I don't recall the exact details of the discussion, since most of it
>> happened almost a year ago, but the main concern about the original
>> costing approach is that it very much assumes uniform distribution.
>>
>> For example if you have N tuples with M groups (which are not computed
>> using estimate_num_groups IIRC, and we can hardly do much better), then
>> the patch assumed each groups is ~N/M rows and used that for computing
>> the number of comparisons for a particular sequence of ORDER BY clauses.
>>
>> But that may result in pretty significant errors in causes with a couple
>> of very large groups.
>
>Yes, you're right, the current version of the patch assumes uniform
>distribution of values between these M groups. After some thinking I've got an
>idea that maybe it's better to not try to find out what would be acceptable for
>both uniform and non uniform distributions, but just do not reorder at all if
>there are any significant deviations from what seems to be a "best case",
>namely when values distributed among groups relatively uniformly and there are
>no doubts about how complicated is to compare them.
>
>Saying that, it's possible on top of the current implementation to check:
>
>* dependency coefficient between columns (if I understand correctly, non
> uniform distributions of values between the groups a.k.a few large groups
> should be visible from an extended statistics as a significant dependency)
>
>* largest/smallest group in MCV doesn't differ too much from an expected
> "average" group
>
>* average width and comparison procost are similar
>
>If everything fits (which I hope would be most of the time) - apply reorder,
>otherwise use whatever order was specified (which leaves the room for manual
>reordering for relatively rare situations). Does it makes sense?
>

I think those are interesting sources of information. I don't know if it
will be sufficient, but I think it's worth exploring.

One of the issues with this approach however will be selecting the
threshold values. That is, how do you decide whether that a given
functional dependency is "too strong" or the largest/smallest MCV item
differs too much from the average one?

If you pick a particular threshold value, there'll be a sudden behavior
change at some point, and that's very annoying. For example, you might
pick 0.5 as a threshold for "strong" functional dependencies. There's
little reason why 0.4999 should be weak and 0.5001 should be "strong".

This may lead to sudden behavior changes after ANALYZE, for example.

So I think we need to find a way to make this part of the costing model,
which is how we deal with such cases elsewhere.

>> But what I think we could do is using largest possible group instead of
>> the average one. So for example when estimating number of comparisons
>> for columns (c1,..,cN), you could look at MCV lists for these columns
>> and compute
>>
>> m(i) = Max(largest group in MCV list for i-th column)
>>
>> and then use Min(m(1), ..., m(k)) when estimating the number of
>> comparisons.
>
>I see the idea, but I'm a bit confused about how to get a largest group for a
>MCV list? I mean there is a list of most common values per column with
>frequencies, and similar stuff for multi columns statistics, but how to get a
>size of those groups?

You can just multiply the frequency by the number of rows in the table,
that gives you the size of the group. Or an estimate of it, because
there might be a filter condition involved.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2019-05-24 23:18:30 Re: Suppressing noise in successful check-world runs
Previous Message Tomas Vondra 2019-05-24 22:42:39 Re: [HACKERS] Runtime Partition Pruning