Quick Links

Re: Additional improvements to extended statistics

From:	Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To:	Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Daniel Gustafsson <daniel(at)yesql(dot)se>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Additional improvements to extended statistics
Date:	2020-11-29 21:02:22
Message-ID:	ec21ebff-67ff-3d80-462d-131bdde91b58@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 11/29/20 3:57 PM, Dean Rasheed wrote:
>> On Wed, 18 Nov 2020 at 22:37, Tomas Vondra
>> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>>>
>>> Seems fine to me, although the "_opt_ext_stats" is rather cryptic.
>>> AFAICS we use "_internal" for similar functions.
>>>
>
> I have been thinking about this some more. The one part of this that I
> still wasn't happy with was the way that base frequencies were used to
> compute the selectivity correction to apply. As noted in [1], using
> base frequencies in this way isn't really appropriate for clauses
> combined using "OR". The reason is that an item's base frequency is
> computed as the product of the per-column selectivities, so that (freq
> - base_freq) is the right correction to apply for a set of clauses
> combined with "AND", but it doesn't really work properly for clauses
> combined with "OR". This is why a number of the estimates in the
> regression tests end up being significant over-estimates.
>
> I speculated in [1] that we might fix that by tracking which columns
> of the match bitmap actually matched the clauses being estimated, and
> then only use those base frequencies. Unfortunately that would also
> mean changing the format of the stats that we store, and so would be a
> rather invasive change.
>
> It occurred to me though, that there is another, much more
> straightforward way to do it. We can rewrite the "OR" clauses, and
> turn them into "AND" clauses using the fact that
>
> P(A OR B) = P(A) + P(B) - P(A AND B)
>
> and then use the multivariate stats to estimate the P(A AND B) part in
> the usual way.
>

OK, that seems quite reasonable.

> Attached is the resulting patch doing it that way. The main change is
> in the way that statext_mcv_clauselist_selectivity() works, combined
> with a new function mcv_clause_selectivity_or() that does the
> necessary MCV bitmap manipulations.
>
> Doing it this way also means that clausesel.c doesn't need to export
> clauselist_selectivity_or(), and the new set of exported functions
> seem a bit neater now.
>

Nice. I agree this looks way better than the version I hacked together.

I wonder how much of the comment before clauselist_selectivity should
move to clauselist_selectivity_ext - it does talk about range clauses
and so on, but clauselist_selectivity does not really deal with that.
But maybe that's just an implementation detail and it's better to keep
the comment the way it is.

I noticed this outdated comment:

/* Always compute the selectivity using clause_selectivity */
s2 = clause_selectivity_ext(root, clause, varRelid, jointype, sjinfo,

Also, the comment at clauselist_selectivity_or seems to not follow the
usual pattern, which I think is

/*
* function name
* short one-sentence description
*
* ... longer description ...
*/

Those are fairly minor issues. I don't have any deeper objections, and
it seems committable. Do you plan to do that sometime soon?

> A handful of regression test results change, and in all cases except
> one the new estimates are much better. One estimate is made worse, but
> in that case we only have 2 sets of partial stats:
>
> SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0
>
> with stats on (a,b) and (c,d) so it's not surprising that combining (a
> = 0 OR b = 0) with (c = 0 OR d = 0) mis-estimates a bit. I suspect the
> old MV stats estimate was more down to chance in this case.
>

Yeah, that's quite possible - we're multiplying two estimates, but
there's a clear correlation. So it was mostly luck we had over-estimated
the clauses before, which gave us higher product and thus accidentally
better overall estimate.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Re: Additional improvements to extended statistics at 2020-11-29 14:57:42 from Dean Rasheed

Responses

Re: Additional improvements to extended statistics at 2020-12-01 08:15:15 from Dean Rasheed
Re: Additional improvements to extended statistics at 2020-12-02 15:51:29 from Dean Rasheed

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tomas Vondra	2020-11-29 21:10:03	Re: Why does create_gather_merge_plan need make_sort?
Previous Message	Paul A Jungwirth	2020-11-29 20:53:40	Re: range_agg