From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> |
Cc: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Mark Dilger <hornschnorter(at)gmail(dot)com>, Adrien Nayrat <adrien(dot)nayrat(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: [HACKERS] PATCH: multivariate histograms and MCV lists |
Date: | 2018-03-29 13:36:09 |
Message-ID: | 7d12b5d1-ac77-67e3-c67e-7fcbd3753e9c@2ndquadrant.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 03/29/2018 02:27 AM, Dean Rasheed wrote:
> On 28 March 2018 at 15:50, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> After thinking about this a bit more, I'm not sure if updating the info
>> based on recursive calls makes sense. The fullmatch flag was supposed to
>> answer a simple question - can there be just a single matching item?
>>
>> If there are equality conditions on all columns, there can be just a
>> single matching item - if we have found it in the MCV (i.e. s1 > 0.0),
>> then we don't need to inspect the non-MCV part.
>>
>> But handling this in recursive manner breaks this entirely, because with
>> something like
>>
>> (a=1) AND (b=1 OR b=2)
>>
>> you suddenly can have multiple matching items. Which makes the fullmatch
>> flag somewhat useless.
>>
>> So I think we should be looking at top-level equality clauses only, just
>> like number_of_groups() does.
>>
>
> I'm not quite sure what you mean by that, but it sounds a bit limiting
> in terms of the kinds of user queries that would be supported.
>
Let me explain. The question is "Can there be just a single combination
of values matching the conditions?" which (if true) allows us to produce
better estimates. If we found a match in the MCV, we don't need to look
at the non-MCV part. If not found in the MCV, we can compute an average
selectivity as 1/ndistinct (possibly using the ndistinct coefficients).
If we can't deduce the existence of a single possible match, we have to
compute an estimate in a more generic way.
With (a=1 AND b=1) and stats on (a,b) there's just a single possible
match (1,1), so that's fine. But it does not work once we start looking
for equalities nested deeper - for example (a=1 AND (b=1 OR b=2)) can be
translated as ((a=1 AND b=1) OR (a=1 AND b=2)) so technically there's an
equality on each column, but there are two possible matches (1,1) and
(1,2). So the optimization does not work.
Does that clarify what I meant?
Although, perhaps we could improve this by deducing number of possible
matches and then track matching items in the MCV list. But that seems
quite a bit harder.
(Of course, we need to consider the non-equality clauses in both cases,
the WIP patch does not do that yet.)
>
>> I think we can remove the fullmatch flag from mcv_update_bitmap
>> entirely. All we need to know is the presence of equality clauses and
>> whether there was a match in MCV (which we know from s1 > 0.0).
>>
>
> I agree with removing the fullmatch flag, but I don't think we
> actually need to know about the presence of equality clauses:
>
> The way that mcv_update_bitmap() recursively computes the set of
> matching MCVs seems to be correct. That gives us a value (call it
> mcv_matchsel) for the proportion of the table's rows that are in the
> MCV list and satisfy the clauses in stat_clauses.
>
Sure, but the extra bit of information allows us to (a) ignore the
non-MCV part and (b) apply the 1/ndistinct estimate.
> We can also estimate that there are (1-mcv_totalsel)*N rows that are
> not in the MCV list, for which the MCV stats therefore tell us
> nothing. The best way to estimate those rows would seem to be to use
> the logic from the guts of clauselist_selectivity(), without
> consulting any extended MCV stats (but still consulting other extended
> stats, I think). Doing that would return a selectivity value (call it
> nonmcv_sel) for those remaining rows. Then a reasonable estimate for
> the overall selectivity would seem to be
>
> mcv_matchsel + (1-mcv_totalsel) * nonmcv_sel
>
> and there would be no need for mcv_update_bitmap() to track eqmatches
> or return fullmatch, and it wouldn't actually matter whether or not we
> had equality clauses or if all the MCV columns were used.
>
Right, although I'm not sure about fallback to clauselist_selectivity()
which kinda throws away the statistical dependency.
That's why I think we should use 1/ndistinct for equality clauses, and
then perhaps leverage the MCV for non-equality clauses somehow.
It just occurred we can apply the 1/ndistinct estimate for equalities
even when we it's not a 'fullmatch'.
So what I propose is roughly this
1) compute selectivity "mcv_sel" using MCV
2) see if there can be just a single match, and (mcv_sel > 0) - if yes,
we're done and we don't need to look at non-MCV part
3) split the clauses into top-level equality clauses and the rest
4) estimate "equal_sel" for equality clauses using 1/ndistinct
5) estimate the "inequal_sel" for remaining clauses using MCV (assumes
the selectivity will be the same on non-MCV part)
6) total selectivity is
mcv_sel + (1 - mcv_totalsel) * equal_sel * inequal_sel
We may need to fall back to clauselist_selectivity() in some cases, of
course, but I think we should leverage the MCV as much as possible.
Another thing is that some of this will change once the histograms are
considered, which helps with estimating the non-MCV part.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Tomas Vondra | 2018-03-29 13:40:03 | Re: Function to track shmem reinit time |
Previous Message | Teodor Sigaev | 2018-03-29 13:35:39 | Re: Cast jsonb to numeric, int, float, bool |