Quick Links

Re: PoC/WIP: Extended statistics on expressions

From:	Justin Pryzby <pryzby(at)telsasoft(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc:	Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PoC/WIP: Extended statistics on expressions
Date:	2021-03-24 06:24:47
Message-ID:	20210324062446.GA28335@telsasoft.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Most importantly, it looks like this forgets to update catalog documentation
for stxexprs and stxkind='e'

It seems like you're preferring to use pluralized "statistics" in a lot of
places that sound wrong to me. For example:
> Currently the first statistics wins, which seems silly.
I can write more separately, but I think this is resolved and clarified if you
write "statistics object" and not just "statistics".

> + Name of schema containing table

I don't know about the nearby descriptions, but this one sounds too much like a
"schema-containing" table. Say "Name of the schema which contains the table" ?

> + Name of table

Say "name of table on which the extended statistics are defined"

> + Name of extended statistics

"Name of the extended statistic object"

> + Owner of the extended statistics

..object

> + Expression the extended statistics is defined on

I think it should say "the extended statistic", or "the extended statistics
object". Maybe "..on which the extended statistic is defined"

> + of random access to the disk. (This expression is null if the expression
> + data type does not have a <literal><</literal> operator.)

expression's data type

> + much-too-small row count estimate in the first two queries. Moreover, the

maybe say "dramatically underestimates the rowcount"

> + planner has no information about relationship between the expressions, so it

the relationship

> + assumes the two <literal>WHERE</literal> and <literal>GROUP BY</literal>
> + conditions are independent, and multiplies their selectivities together to
> + arrive at a much-too-high group count estimate in the aggregate query.

severe overestimate ?

> + This is further exacerbated by the lack of accurate statistics for the
> + expressions, forcing the planner to use default ndistinct estimate for the

use *a default

> + expression derived from ndistinct for the column. With such statistics, the
> + planner recognizes that the conditions are correlated and arrives at much
> + more accurate estimates.

are correlated comma

> + if (type->lt_opr == InvalidOid)

These could be !OidIsValid

> + * expressions. It's either expensive or very easy to defeat for
> + * determined used, and there's no risk if we allow such statistics (the
> + * statistics is useless, but harmless).

I think it's meant to say "for a determined user" ?

> + * If there are no simply-referenced columns, give the statistics an auto
> + * dependency on the whole table. In most cases, this will be redundant,
> + * but it might not be if the statistics expressions contain no Vars
> + * (which might seem strange but possible).
> + */
> + if (!nattnums)
> + {
> + ObjectAddressSet(parentobject, RelationRelationId, relid);
> + recordDependencyOn(&myself, &parentobject, DEPENDENCY_AUTO);
> + }

Can this be unconditional ?

> + * Translate the array of indexs to regular attnums for the dependency (we

sp: indexes

> + * Not found a matching expression, so we can simply skip

Found no matching expr

> + /* if found a matching, */

matching ..

> +examine_attribute(Node *expr)

Maybe you should rename this to something distinct ? So it's easy to add a
breakpoint there, for example.

> + stats->anl_context = CurrentMemoryContext; /* XXX should be using
> + * something else? */

> + bool nulls[Natts_pg_statistic];
...
> + * Construct a new pg_statistic tuple
> + */
> + for (i = 0; i < Natts_pg_statistic; ++i)
> + {
> + nulls[i] = false;
> + }

Shouldn't you just write nulls[Natts_pg_statistic] = {false};
or at least: memset(nulls, 0, sizeof(nulls));

> + * We don't store collations used to build the statistics, but
> + * we can use the collation for the attribute itself, as
> + * stored in varcollid. We do reset the statistics after a
> + * type change (including collation change), so this is OK. We
> + * may need to relax this after allowing extended statistics
> + * on expressions.

This text should be updated or removed ?

> @@ -2705,7 +2705,108 @@ describeOneTableDetails(const char *schemaname,
> }
>
> /* print any extended statistics */
> - if (pset.sversion >= 100000)
> + if (pset.sversion >= 140000)
> + {
> + printfPQExpBuffer(&buf,
> + "SELECT oid, "
> + "stxrelid::pg_catalog.regclass, "
> + "stxnamespace::pg_catalog.regnamespace AS nsp, "
> + "stxname,\n"
> + "pg_get_statisticsobjdef_columns(oid) AS columns,\n"
> + " 'd' = any(stxkind) AS ndist_enabled,\n"
> + " 'f' = any(stxkind) AS deps_enabled,\n"
> + " 'm' = any(stxkind) AS mcv_enabled,\n");
> +
> + if (pset.sversion >= 130000)
> + appendPQExpBufferStr(&buf, " stxstattarget\n");
> + else
> + appendPQExpBufferStr(&buf, " -1 AS stxstattarget\n");

>= 130000 is fully determined by >= 14000 :)

> + * type of the opclass, which is not interesting for our purposes. (Note:
> + * if we did anything with non-expression index columns, we'd need to

index is wrong ?

I mentioned a bunch of other references to "index" and "predicate" which are
still around:

On Thu, Jan 07, 2021 at 08:35:37PM -0600, Justin Pryzby wrote:
> There's some remaining copy/paste stuff from index expressions:
>
> errmsg("statistics expressions and predicates can refer only to the table being indexed")));
> left behind by evaluating the predicate or index expressions.
> Set up for predicate or expression evaluation
> Need an EState for evaluation of index expressions and
> partial-index predicates. Create it in the per-index context to be
> Fetch function for analyzing index expressions.

In response to

Re: PoC/WIP: Extended statistics on expressions at 2021-03-24 00:51:09 from Tomas Vondra

Responses

Re: PoC/WIP: Extended statistics on expressions at 2021-03-24 16:38:32 from Tomas Vondra
Re: PoC/WIP: Extended statistics on expressions at 2021-03-24 16:42:26 from Justin Pryzby

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Drouvot, Bertrand	2021-03-24 06:56:51	Re: Minimal logical decoding on standbys
Previous Message	Mark Dilger	2021-03-24 06:13:07	Re: pg_amcheck contrib application