From: | Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> |
---|---|
To: | Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Daniel Gustafsson <daniel(at)yesql(dot)se>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Additional improvements to extended statistics |
Date: | 2020-11-29 21:02:22 |
Message-ID: | ec21ebff-67ff-3d80-462d-131bdde91b58@enterprisedb.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 11/29/20 3:57 PM, Dean Rasheed wrote:
>> On Wed, 18 Nov 2020 at 22:37, Tomas Vondra
>> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>>>
>>> Seems fine to me, although the "_opt_ext_stats" is rather cryptic.
>>> AFAICS we use "_internal" for similar functions.
>>>
>
> I have been thinking about this some more. The one part of this that I
> still wasn't happy with was the way that base frequencies were used to
> compute the selectivity correction to apply. As noted in [1], using
> base frequencies in this way isn't really appropriate for clauses
> combined using "OR". The reason is that an item's base frequency is
> computed as the product of the per-column selectivities, so that (freq
> - base_freq) is the right correction to apply for a set of clauses
> combined with "AND", but it doesn't really work properly for clauses
> combined with "OR". This is why a number of the estimates in the
> regression tests end up being significant over-estimates.
>
> I speculated in [1] that we might fix that by tracking which columns
> of the match bitmap actually matched the clauses being estimated, and
> then only use those base frequencies. Unfortunately that would also
> mean changing the format of the stats that we store, and so would be a
> rather invasive change.
>
> It occurred to me though, that there is another, much more
> straightforward way to do it. We can rewrite the "OR" clauses, and
> turn them into "AND" clauses using the fact that
>
> P(A OR B) = P(A) + P(B) - P(A AND B)
>
> and then use the multivariate stats to estimate the P(A AND B) part in
> the usual way.
>
OK, that seems quite reasonable.
> Attached is the resulting patch doing it that way. The main change is
> in the way that statext_mcv_clauselist_selectivity() works, combined
> with a new function mcv_clause_selectivity_or() that does the
> necessary MCV bitmap manipulations.
>
> Doing it this way also means that clausesel.c doesn't need to export
> clauselist_selectivity_or(), and the new set of exported functions
> seem a bit neater now.
>
Nice. I agree this looks way better than the version I hacked together.
I wonder how much of the comment before clauselist_selectivity should
move to clauselist_selectivity_ext - it does talk about range clauses
and so on, but clauselist_selectivity does not really deal with that.
But maybe that's just an implementation detail and it's better to keep
the comment the way it is.
I noticed this outdated comment:
/* Always compute the selectivity using clause_selectivity */
s2 = clause_selectivity_ext(root, clause, varRelid, jointype, sjinfo,
Also, the comment at clauselist_selectivity_or seems to not follow the
usual pattern, which I think is
/*
* function name
* short one-sentence description
*
* ... longer description ...
*/
Those are fairly minor issues. I don't have any deeper objections, and
it seems committable. Do you plan to do that sometime soon?
> A handful of regression test results change, and in all cases except
> one the new estimates are much better. One estimate is made worse, but
> in that case we only have 2 sets of partial stats:
>
> SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0
>
> with stats on (a,b) and (c,d) so it's not surprising that combining (a
> = 0 OR b = 0) with (c = 0 OR d = 0) mis-estimates a bit. I suspect the
> old MV stats estimate was more down to chance in this case.
>
Yeah, that's quite possible - we're multiplying two estimates, but
there's a clear correlation. So it was mostly luck we had over-estimated
the clauses before, which gave us higher product and thus accidentally
better overall estimate.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From | Date | Subject | |
---|---|---|---|
Next Message | Tomas Vondra | 2020-11-29 21:10:03 | Re: Why does create_gather_merge_plan need make_sort? |
Previous Message | Paul A Jungwirth | 2020-11-29 20:53:40 | Re: range_agg |