Quick Links

Re: Proper usage of ndistinct vs. dependencies extended statistics

From:	David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To:	Paul Martinez <hellopfm(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proper usage of ndistinct vs. dependencies extended statistics
Date:	2019-04-11 04:26:11
Message-ID:	CAKJS1f8159XZ6xJgxSQq98jRdVRF2JWFwiWhJys6OQJn3r_DcA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Thu, 11 Apr 2019 at 11:53, Paul Martinez <hellopfm(at)gmail(dot)com> wrote:
>
> I have some questions about the different types of extended statistics
> that were introduced in Postgres 10.
> - Which types of queries are each statistic type supposed to improve?

Multivariate ndistinct stats are aimed to improve distinct estimates
over groups of columns. These can help in cases like GROUP BY a,b,
SELECT DISTINCT a,b, SELECT a,b FROM x UNION SELECT a,b FROM y; They
also help in determining the number of times an index will be
rescanned in cases like nested loops with a parameterised inner path.

I see multivariate ndistinct estimates are not used for normal
selectivity estimates for unknown values. e.g PREPARE q1 (int, int)
AS SELECT * FROM t1 WHERE a = $1 and b = $2; still assumes a and b are
independent even when ndistinct stats exist on the two columns.

There are a few other usages too. See calls of estimate_num_groups()

dependency stats just handle WHERE clauses (or more accurately,
clauses containing a reference to a single relation. These only
handle equality OpExprs. e.g "a = 10 and y = 3", not "a < 6 and y =
3". Further stat types (most common values) added in PG12 aim to
allow inequality operators too.

> - When should one type of statistic be used over the other? Should they
> both always be used?

If they both always should be always used then we'd likely not have
bothered making the types optional. Both ndistinct and dependency
stats are fairly cheap to calculate and store, so it might not be too
big an issue adding both types if you're not sure. With these two
types there's not really any choice for the planner to decide to use
one or the other, it just makes use of the ones it can use for the
given situation. That won't be the case as more stats types get
added. In PG12, for example, we had to choose of MCV stats should be
applied before dependencies stats. That might be a no-brainer, but
perhaps the future there will be stats types where the order to apply
them is not so clear, although in those cases it might be questionable
why you'd want to define more than one type of stats on the same set
of columns.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Proper usage of ndistinct vs. dependencies extended statistics at 2019-04-10 23:52:27 from Paul Martinez

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Michael Paquier	2019-04-11 04:26:38	Re: Cleanup/remove/update references to OID column
Previous Message	Michael Paquier	2019-04-11 04:22:28	Re: block-level incremental backup