Quick Links

Re: Additional improvements to extended statistics

From:	Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To:	Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Daniel Gustafsson <daniel(at)yesql(dot)se>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Additional improvements to extended statistics
Date:	2020-12-02 16:34:04
Message-ID:	01a57d32-e442-b57f-6154-fd42129d6307@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 12/2/20 4:51 PM, Dean Rasheed wrote:
> On Sun, 29 Nov 2020 at 21:02, Tomas Vondra
> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>>
>> I wonder how much of the comment before clauselist_selectivity should
>> move to clauselist_selectivity_ext - it does talk about range clauses
>> and so on, but clauselist_selectivity does not really deal with that.
>> But maybe that's just an implementation detail and it's better to keep
>> the comment the way it is.
>
> I think it's better to keep it the way it is, so that the entirety of
> what clauselist_selectivity() does (via clauselist_selectivity_ext())
> can be read in one place, but I have added separate comments for the
> new "_ext" functions to explain how they differ. That matches similar
> examples elsewhere.
>

>
>> I noticed this outdated comment:
>>
>> /* Always compute the selectivity using clause_selectivity */
>> s2 = clause_selectivity_ext(root, clause, varRelid, jointype, sjinfo,
>
> Updated.
>
>
>> Also, the comment at clauselist_selectivity_or seems to not follow the
>> usual pattern, which I think is
>>
>> /*
>> * function name
>> * short one-sentence description
>> *
>> * ... longer description ...
>> */
>
> Hmm, it seems OK to me. The first part is basically copied from
> clauselist_selectivity(). The "longer description" doesn't really have
> much more to say because it's much simpler than
> clauselist_selectivity(), but it seems neater to keep the two roughly
> in sync.
>

I see. In that case it's OK, I guess.

>
> I've been hacking on this a bit more and attached is an updated
> (hopefully final) version with some other comment improvements and
> also a couple of other tweaks:
>
> The previous version had duplicated code blocks that implemented the
> same MCV-correction algorithm using simple_sel, mcv_sel, base_sel,
> other_sel and total_sel, which was quite messy. So I refactored that
> into a new function mcv_combine_selectivities(). About half the
> comments from statext_mcv_clauselist_selectivity() then move over to
> mcv_combine_selectivities(). I also updated the comments for
> mcv_clauselist_selectivity() and mcv_clause_selectivity_or() to
> explain how their outputs are expected to be used by
> mcv_combine_selectivities(). That hopefully makes for a clean
> separation of concerns, and makes it easier to tweak the way MCV stats
> are applied on top of simple stats, if someone thinks of a better
> approach in the future.
>
> In the previous version, for an ORed list of clauses, the MCV
> correction was only applied to the overlaps between clauses. That's OK
> as long as each clause only refers to a single column, since the
> per-column statistics ought to be the best way to estimate each
> individual clause in that case. However, if the individual clauses
> refer to more than one column, I think the MCV correction should be
> applied to each individual clause as well as to the overlaps. That
> turns out to be pretty straightforward, since we're already doing all
> the hard work of computing the match bitmap for each clause. The sort
> of queries I had in mind were things like this:
>
> WHERE (a = 1 AND b = 1) OR (a = 2 AND b = 2)
>
> I added a new test case along those lines and the new estimates are
> much better than they are without this patch, but not for the reason I
> thought --- the old code consistently over-estimated queries like that
> because it actually applied the MCV correction twice (once while
> processing each AND list, via clause_selectivity(), called from
> clauselist_selectivity_simple(), and once for the top-level OR clause,
> contained in a single-element implicitly-ANDed list). The way the new
> code is structured avoids any kind of double application of extended
> stats, producing a much better estimate, which is good.
>

Nice.

> However, the new code doesn't apply the extended stats directly using
> clauselist_selectivity_or() for this kind of query because there are
> no RestrictInfos for the nested AND clauses, so
> find_single_rel_for_clauses() (and similarly
> statext_is_compatible_clause()) regards those clauses as not
> compatible with extended stats. So what ends up happening is that
> extended stats are used only when we descend down to the two AND
> clauses, and their results are combined using the original "s1 + s2 -
> s1 * s2" formula. That actually works OK in this case, because there
> is no overlap between the two AND clauses, but it wouldn't work so
> well if there was.
>
> I'm pretty sure that can be fixed by teaching
> find_single_rel_for_clauses() and statext_is_compatible_clause() to
> handle BoolExpr clauses, looking for RestrictInfos underneath them,
> but I think that should be left for a follow-in patch. I have left a
> regression test in place, whose estimates ought to be improved by such
> a fix.
>

Yeah, I agree with leaving this for a separate patch. We can't do
everything at the same time.

> The upshot of all that is that the new code that applies the MCV
> correction to the individual clauses in an ORed list doesn't help with
> queries like the one above at the moment, and it's not obvious whether
> it is currently reachable, but I think it's worth leaving in because
> it seems more principled, and makes that code more future-proof. I
> also think it's neater because now the signature of
> mcv_clause_selectivity_or() is more natural --- it's primary return
> value is now the clause's MCV selectivity, as suggested by the
> function's name, rather than the overlap selectivity that the previous
> version was returning. Also, after your "Var Op Var" patch is applied,
> I think it would be possible to construct queries that would benefit
> from this, so it would be good to get that committed too.
>
> Barring any further comments, I'll push this sometime soon.
>

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Re: Additional improvements to extended statistics at 2020-12-02 15:51:29 from Dean Rasheed

Responses

Re: Additional improvements to extended statistics at 2020-12-03 12:56:34 from Dean Rasheed

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2020-12-02 16:52:54	Re: [HACKERS] [PATCH] Generic type subscripting
Previous Message	Chapman Flack	2020-12-02 16:32:04	Re: proposal: unescape_text function