Re: [HACKERS] PATCH: multivariate histograms and MCV lists

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Mark Dilger <hornschnorter(at)gmail(dot)com>, Adrien Nayrat <adrien(dot)nayrat(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date: 2018-07-16 20:55:37
Message-ID: 80bc98b2-1c43-1fe8-b316-d447bb68cff2@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 07/16/2018 02:54 PM, Dean Rasheed wrote:
> On 16 July 2018 at 13:23, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>>> The top-level clauses allow us to make such deductions, with deeper
>>>> clauses it's much more difficult (perhaps impossible). Because for
>>>> example with (a=1 AND b=1) there can be just a single match, so if we
>>>> find it in MCV we're done. With clauses like ((a=1 OR a=2) AND (b=1 OR
>>>> b=2)) it's not that simple, because there may be multiple combinations
>>>> and so a match in MCV does not guarantee anything.
>>>
>>> Actually, it guarantees a lower bound on the overall selectivity, and
>>> maybe that's the best that we can do in the absence of any other
>>> stats.
>>>
>> Hmmm, is that actually true? Let's consider a simple example, with two
>> columns, each with just 2 values, and a "perfect" MCV list:
>>
>> a | b | frequency
>> -------------------
>> 1 | 1 | 0.5
>> 2 | 2 | 0.5
>>
>> And let's estimate sel(a=1 & b=2).
>
> OK.In this case, there are no MCV matches, so there is no lower bound (it's 0).
>
> What we could do though is also impose an upper bound, based on the
> sum of non-matching MCV frequencies. In this case, the upper bound is
> also 0, so we could actually say the resulting selectivity is 0.
>

Hmmm, it's not very clear to me how would we decide which of these cases
applies, because in most cases we don't have MCV covering 100% rows.

Anyways, I've been thinking about how to modify the code to wort the way
you proposed (in a way sufficient for a PoC). But after struggling with
it for a while it occurred to me it might be useful to do it on paper
first, to verify how would it work on your examples.

So let's use this data

create table foo(a int, b int);
insert into foo select 1,1 from generate_series(1,50000);
insert into foo select 1,2 from generate_series(1,40000);
insert into foo select 1,x/10 from generate_series(30,250000) g(x);
insert into foo select 2,1 from generate_series(1,30000);
insert into foo select 2,2 from generate_series(1,20000);
insert into foo select 2,x/10 from generate_series(30,500000) g(x);
insert into foo select 3,1 from generate_series(1,10000);
insert into foo select 3,2 from generate_series(1,5000);
insert into foo select 3,x from generate_series(3,600000) g(x);
insert into foo select x,x/10 from generate_series(4,750000) g(x);

Assuming we have perfectly exact statistics, we have these MCV lists
(both univariate and multivariate):

select a, count(*), round(count(*) /2254937.0, 4) AS frequency
from foo group by a order by 2 desc;

a | count | frequency
--------+--------+-----------
3 | 614998 | 0.2727
2 | 549971 | 0.2439
1 | 339971 | 0.1508
1014 | 1 | 0.0000
57220 | 1 | 0.0000
...

select b, count(*), round(count(*) /2254937.0, 4) AS frequency
from foo group by b order by 2 desc;

b | count | frequency
--------+-------+-----------
1 | 90010 | 0.0399
2 | 65010 | 0.0288
3 | 31 | 0.0000
7 | 31 | 0.0000
...

select a, b, count(*), round(count(*) /2254937.0, 4) AS frequency
from foo group by a, b order by 3 desc;

a | b | count | frequency
--------+--------+-------+-----------
1 | 1 | 50000 | 0.0222
1 | 2 | 40000 | 0.0177
2 | 1 | 30000 | 0.0133
2 | 2 | 20000 | 0.0089
3 | 1 | 10000 | 0.0044
3 | 2 | 5000 | 0.0022
2 | 12445 | 10 | 0.0000
...

And let's estimate the two queries with complex clauses, where the
multivariate stats gave 2x overestimates.

SELECT * FROM foo WHERE a=1 and (b=1 or b=2);
-- actual 90000, univariate: 24253, multivariate: 181091

univariate:

sel(a=1) = 0.1508
sel(b=1) = 0.0399
sel(b=2) = 0.0288
sel(b=1 or b=2) = 0.0673

multivariate:
sel(a=1 and (b=1 or b=2)) = 0.0399 (0.0770)

The second multivariate estimate comes from assuming only the first 5
items make it to the multivariate MCV list (covering 6.87% of the data)
and extrapolating the selectivity to the non-MCV data too.

(Notice it's about 2x the actual selectivity, so this extrapolation due
to not realizing the MCV already contains all the matches is pretty much
responsible for the whole over-estimate).

So, how would the proposed algorithm work? Let's start with "a=1":

sel(a=1) = 0.1508

I don't see much point in applying the two "b" clauses independently (or
how would it be done, as it's effectively a single clause):

sel(b=1 or b=2) = 0.0673

And we get

total_sel = sel(a=1) * sel(b=1 or b=2) = 0.0101

From the multivariate MCV we get

mcv_sel = 0.0399

And finally

total_sel = Max(total_sel, mcv_sel) = 0.0399

Which is great, but I don't see how that actually helped anything? We
still only have the estimate for the ~7% covered by the MCV list, and
not the remaining non-MCV part.

I could do the same thing for the second query, but the problem there is
actually exactly the same - extrapolation from MCV to non-MCV part
roughly doubles the estimate.

So unless I'm applying your algorithm incorrectly, this does not seem
like a very promising direction :-(

There may be valuable information we could learn from the univariate
estimates (using a Max() of them as an upper boundary seems reasonable),
but that's still quite crude. And it will only ever work with simple
top-level clauses. Once the clauses get more complicated, it seems
rather tricky - presumably multivariate stats would be only used for
correlated columns, so trying to deduce something from univariate
estimates on complex clauses on such columns seems somewhat suspicious.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2018-07-16 21:06:44 Re: New GUC to sample log queries
Previous Message Tom Lane 2018-07-16 20:36:51 Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)"