Re: Improve join selectivity estimation using extended statistics

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Yugo NAGATA <nagata(at)sraoss(dot)co(dot)jp>
Subject: Re: Improve join selectivity estimation using extended statistics
Date: 2021-03-11 00:47:14
Message-ID: 2a7a4c58-d9b4-5529-945a-0dcba6f6ef22@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Konstantin,

Thanks for working on this! Using extended statistics to improve join
cardinality estimates was definitely on my radar, and this patch seems
like a good start.

I had two basic ideas about how we might improve join estimates:

(a) use per-table extended statistics to estimate join conditions

(b) invent multi-table extended statistics (requires inventing how to
sample the tables in a coordinated way, etc.)

This patch aims to do (a) which is perfectly reasonable - I think we can
achieve significant improvements this way. I have some ideas about (b),
but it seems harder and for a separate thread/patch.

The patch includes some *very* interesting ideas, but I think it's does
them too late and at the wrong level of abstraction. I mean that:

1) I don't think the code in clausesel.c should deal with extended
statistics directly - it requires far too much knowledge about different
types of extended stats, what clauses are supported by them, etc.
Allowing stats on expressions will make this even worse.

Better do that in extended_stats.c, like statext_clauselist_selectivity.

2) in clauselist_selectivity_ext, functional dependencies are applied in
the part that processes remaining clauses, not estimated using extended
statistics. That seems a bit confusing, and I suspect it may lead to
issues - for example, it only processes the clauses incrementally, in a
particular order. That probably affects the result, because it affects
which functional dependencies we can apply.

In the example query that's not an issue, because it only has two Vars,
so it either can't apply anything (with one Var) or it can apply
everything (with two Vars). But with 3 or more Vars the order would
certainly matter, so it's problematic.

Moreover, it seems a bit strange that this considers dependencies only
on the inner relation. Can't that lead to issues with different join
orders producing different cardinality estimates?

I think a better approach would be to either modify the existing block
dealing with extended stats for a single relation to also handle join
conditions. Or perhaps we should invent a separate block, dealing with
*pairs* of relations? And it should deal with *all* join clauses for
that pair of relations at once, not one by one.

As for the exact implementation, I'd imagine we call overall logic to be
something like (for clauses on two joined relations):

- pick a subset of clauses with the same type of extended statistics on
both sides (MCV, ndistinct, ...), repeat until we can't apply more
statistics

- estimate remaining clauses either using functional dependencies or in
the regular (old) way

As for how to use other types of extended statistics, I think eqjoinsel
could serve as an inspiration. We should look for an MCV list and
ndistinct stats on both sides of the join (possibly on some subset of
clauses), and then do the same thing eqjoinsel does, just with multiple
columns.

Note: I'm not sure what to do when we find the stats only on one side.
Perhaps we should assume the other side does not have correlations and
use per-column statistics (seems reasonable), or maybe just not apply
anything (seems a bit too harsh).

Anyway, if there are some non-estimated clauses, we could try applying
functional dependencies similarly to what this patch does. It's also
consistent with statext_clauselist_selectivity - that also tries to
apply MCV lists first, and only then we try functional dependencies.

BTW, should this still rely on oprrest (e.g. F_EQSEL). That's the
selectivity function for restriction (non-join) clauses, so maybe we
should be looking at oprjoin when dealing with joins? Not sure.

One bit that I find *very* interesting is the calc_joinrel_size_estimate
part, with this comment:

/*
* Try to take in account functional dependencies between attributes
* of clauses pushed-down to joined relations and retstrictlist
* clause. Right now we consider only case of restrictlist consists of
* one clause.
*/

If I understand the comment and the code after it, it essentially tries
to apply extended statistics from both the join clauses and filters at
the relation level. That is, with a query like

SELECT * FROM t1 JOIN t2 ON (t1.a = t2.a) WHERE t1.b = 10

we would be looking at statistics on t1(a,b), because we're interested
in estimating conditional probability distribution

P(t1.a = a? | t1.b = 10)

I think that's extremely interesting and powerful, because it allows us
to "restrict" the multi-column MCV lists, we could probably estimate
number of distinct "a" values in rows with "b=10" like:

ndistinct(a,b) / ndistinct(b)

and do various interesting stuff like this.

That will require some improvements to the extended statistics code (to
allow passing a list of conditions), but that's quite doable. I think
the code actually did something like that originally ;-)

Obviously, none of this is achievable for PG14, as we're in the middle
of the last CF. But if you're interested in working on this for PG15,
I'd love to cooperate on that.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2021-03-11 00:47:49 Re: shared-memory based stats collector
Previous Message Justin Pryzby 2021-03-11 00:40:54 Re: [HACKERS] Custom compression methods