Re: Improve join selectivity estimation using extended statistics

From: Ibrar Ahmed <ibrar(dot)ahmad(at)gmail(dot)com>
To: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
Cc: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Yugo NAGATA <nagata(at)sraoss(dot)co(dot)jp>
Subject: Re: Improve join selectivity estimation using extended statistics
Date: 2021-07-19 10:52:47
Message-ID: CALtqXTdiZ6Aum8xo8vJPcs+oXx5wfNMox3NBtW4yfYU8PGaT1Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 15, 2021 at 8:42 PM Konstantin Knizhnik <
k(dot)knizhnik(at)postgrespro(dot)ru> wrote:

>
>
> On 11.03.2021 03:47, Tomas Vondra wrote:
> > Hi Konstantin,
> >
> > Thanks for working on this! Using extended statistics to improve join
> > cardinality estimates was definitely on my radar, and this patch seems
> > like a good start.
> >
> > I had two basic ideas about how we might improve join estimates:
> >
> > (a) use per-table extended statistics to estimate join conditions
> >
> > (b) invent multi-table extended statistics (requires inventing how to
> > sample the tables in a coordinated way, etc.)
> >
> > This patch aims to do (a) which is perfectly reasonable - I think we can
> > achieve significant improvements this way. I have some ideas about (b),
> > but it seems harder and for a separate thread/patch.
> >
> >
> > The patch includes some *very* interesting ideas, but I think it's does
> > them too late and at the wrong level of abstraction. I mean that:
> >
> > 1) I don't think the code in clausesel.c should deal with extended
> > statistics directly - it requires far too much knowledge about different
> > types of extended stats, what clauses are supported by them, etc.
> > Allowing stats on expressions will make this even worse.
> >
> > Better do that in extended_stats.c, like statext_clauselist_selectivity.
> >
> > 2) in clauselist_selectivity_ext, functional dependencies are applied in
> > the part that processes remaining clauses, not estimated using extended
> > statistics. That seems a bit confusing, and I suspect it may lead to
> > issues - for example, it only processes the clauses incrementally, in a
> > particular order. That probably affects the result, because it affects
> > which functional dependencies we can apply.
> >
> > In the example query that's not an issue, because it only has two Vars,
> > so it either can't apply anything (with one Var) or it can apply
> > everything (with two Vars). But with 3 or more Vars the order would
> > certainly matter, so it's problematic.
> >
> >
> > Moreover, it seems a bit strange that this considers dependencies only
> > on the inner relation. Can't that lead to issues with different join
> > orders producing different cardinality estimates?
> >
> >
> > I think a better approach would be to either modify the existing block
> > dealing with extended stats for a single relation to also handle join
> > conditions. Or perhaps we should invent a separate block, dealing with
> > *pairs* of relations? And it should deal with *all* join clauses for
> > that pair of relations at once, not one by one.
> >
> > As for the exact implementation, I'd imagine we call overall logic to be
> > something like (for clauses on two joined relations):
> >
> > - pick a subset of clauses with the same type of extended statistics on
> > both sides (MCV, ndistinct, ...), repeat until we can't apply more
> > statistics
> >
> > - estimate remaining clauses either using functional dependencies or in
> > the regular (old) way
> >
> >
> > As for how to use other types of extended statistics, I think eqjoinsel
> > could serve as an inspiration. We should look for an MCV list and
> > ndistinct stats on both sides of the join (possibly on some subset of
> > clauses), and then do the same thing eqjoinsel does, just with multiple
> > columns.
> >
> > Note: I'm not sure what to do when we find the stats only on one side.
> > Perhaps we should assume the other side does not have correlations and
> > use per-column statistics (seems reasonable), or maybe just not apply
> > anything (seems a bit too harsh).
> >
> > Anyway, if there are some non-estimated clauses, we could try applying
> > functional dependencies similarly to what this patch does. It's also
> > consistent with statext_clauselist_selectivity - that also tries to
> > apply MCV lists first, and only then we try functional dependencies.
> >
> >
> > BTW, should this still rely on oprrest (e.g. F_EQSEL). That's the
> > selectivity function for restriction (non-join) clauses, so maybe we
> > should be looking at oprjoin when dealing with joins? Not sure.
> >
> >
> > One bit that I find *very* interesting is the calc_joinrel_size_estimate
> > part, with this comment:
> >
> > /*
> > * Try to take in account functional dependencies between attributes
> > * of clauses pushed-down to joined relations and retstrictlist
> > * clause. Right now we consider only case of restrictlist consists of
> > * one clause.
> > */
> >
> > If I understand the comment and the code after it, it essentially tries
> > to apply extended statistics from both the join clauses and filters at
> > the relation level. That is, with a query like
> >
> > SELECT * FROM t1 JOIN t2 ON (t1.a = t2.a) WHERE t1.b = 10
> >
> > we would be looking at statistics on t1(a,b), because we're interested
> > in estimating conditional probability distribution
> >
> > P(t1.a = a? | t1.b = 10)
> >
> > I think that's extremely interesting and powerful, because it allows us
> > to "restrict" the multi-column MCV lists, we could probably estimate
> > number of distinct "a" values in rows with "b=10" like:
> >
> > ndistinct(a,b) / ndistinct(b)
> >
> > and do various interesting stuff like this.
> >
> > That will require some improvements to the extended statistics code (to
> > allow passing a list of conditions), but that's quite doable. I think
> > the code actually did something like that originally ;-)
> >
> >
> > Obviously, none of this is achievable for PG14, as we're in the middle
> > of the last CF. But if you're interested in working on this for PG15,
> > I'd love to cooperate on that.
> >
> >
> > regards
> >
> Hi Tomas,
> Thank you for review of my patch.
> My primary attention was to implement some kid of adaptive query
> optimization based online_analyze extension and building extended
> statistic on demand.
> I have change clausesel.c because right now extended statistic is not
> used for join selectivity estimation and manual or automatic adding such
> statistic can help to
> choose more efficient plan for queries with joins.
> I agree wit you that it can be done in better way, handling more use cases.
> I will be glad to cooperate with you in improving join selectivity
> estimation using extended statistic.
>
>
>
> The patch does not compile, and needs your attention.

https://cirrus-ci.com/task/6397726985289728

clausesel.c:74:28: error: too few arguments to function
‘choose_best_statistics’
StatisticExtInfo *stat = choose_best_statistics(rel->statlist,
STATS_EXT_DEPENDENCIES,
^~~~~~~~~~~~~~~~~~~~~~
In file included from clausesel.c:24:
../../../../src/include/statistics/statistics.h:123:26: note: declared here
exter

I am changing the status to "Waiting on Author".

--
Ibrar Ahmed

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2021-07-19 11:00:47 Re: row filtering for logical replication
Previous Message Ibrar Ahmed 2021-07-19 10:34:56 Re: shared-memory based stats collector