From: | Justin Pryzby <pryzby(at)telsasoft(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> |
Cc: | Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: PoC/WIP: Extended statistics on expressions |
Date: | 2021-03-24 06:24:47 |
Message-ID: | 20210324062446.GA28335@telsasoft.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Most importantly, it looks like this forgets to update catalog documentation
for stxexprs and stxkind='e'
It seems like you're preferring to use pluralized "statistics" in a lot of
places that sound wrong to me. For example:
> Currently the first statistics wins, which seems silly.
I can write more separately, but I think this is resolved and clarified if you
write "statistics object" and not just "statistics".
> + Name of schema containing table
I don't know about the nearby descriptions, but this one sounds too much like a
"schema-containing" table. Say "Name of the schema which contains the table" ?
> + Name of table
Say "name of table on which the extended statistics are defined"
> + Name of extended statistics
"Name of the extended statistic object"
> + Owner of the extended statistics
..object
> + Expression the extended statistics is defined on
I think it should say "the extended statistic", or "the extended statistics
object". Maybe "..on which the extended statistic is defined"
> + of random access to the disk. (This expression is null if the expression
> + data type does not have a <literal><</literal> operator.)
expression's data type
> + much-too-small row count estimate in the first two queries. Moreover, the
maybe say "dramatically underestimates the rowcount"
> + planner has no information about relationship between the expressions, so it
the relationship
> + assumes the two <literal>WHERE</literal> and <literal>GROUP BY</literal>
> + conditions are independent, and multiplies their selectivities together to
> + arrive at a much-too-high group count estimate in the aggregate query.
severe overestimate ?
> + This is further exacerbated by the lack of accurate statistics for the
> + expressions, forcing the planner to use default ndistinct estimate for the
use *a default
> + expression derived from ndistinct for the column. With such statistics, the
> + planner recognizes that the conditions are correlated and arrives at much
> + more accurate estimates.
are correlated comma
> + if (type->lt_opr == InvalidOid)
These could be !OidIsValid
> + * expressions. It's either expensive or very easy to defeat for
> + * determined used, and there's no risk if we allow such statistics (the
> + * statistics is useless, but harmless).
I think it's meant to say "for a determined user" ?
> + * If there are no simply-referenced columns, give the statistics an auto
> + * dependency on the whole table. In most cases, this will be redundant,
> + * but it might not be if the statistics expressions contain no Vars
> + * (which might seem strange but possible).
> + */
> + if (!nattnums)
> + {
> + ObjectAddressSet(parentobject, RelationRelationId, relid);
> + recordDependencyOn(&myself, &parentobject, DEPENDENCY_AUTO);
> + }
Can this be unconditional ?
> + * Translate the array of indexs to regular attnums for the dependency (we
sp: indexes
> + * Not found a matching expression, so we can simply skip
Found no matching expr
> + /* if found a matching, */
matching ..
> +examine_attribute(Node *expr)
Maybe you should rename this to something distinct ? So it's easy to add a
breakpoint there, for example.
> + stats->anl_context = CurrentMemoryContext; /* XXX should be using
> + * something else? */
> + bool nulls[Natts_pg_statistic];
...
> + * Construct a new pg_statistic tuple
> + */
> + for (i = 0; i < Natts_pg_statistic; ++i)
> + {
> + nulls[i] = false;
> + }
Shouldn't you just write nulls[Natts_pg_statistic] = {false};
or at least: memset(nulls, 0, sizeof(nulls));
> + * We don't store collations used to build the statistics, but
> + * we can use the collation for the attribute itself, as
> + * stored in varcollid. We do reset the statistics after a
> + * type change (including collation change), so this is OK. We
> + * may need to relax this after allowing extended statistics
> + * on expressions.
This text should be updated or removed ?
> @@ -2705,7 +2705,108 @@ describeOneTableDetails(const char *schemaname,
> }
>
> /* print any extended statistics */
> - if (pset.sversion >= 100000)
> + if (pset.sversion >= 140000)
> + {
> + printfPQExpBuffer(&buf,
> + "SELECT oid, "
> + "stxrelid::pg_catalog.regclass, "
> + "stxnamespace::pg_catalog.regnamespace AS nsp, "
> + "stxname,\n"
> + "pg_get_statisticsobjdef_columns(oid) AS columns,\n"
> + " 'd' = any(stxkind) AS ndist_enabled,\n"
> + " 'f' = any(stxkind) AS deps_enabled,\n"
> + " 'm' = any(stxkind) AS mcv_enabled,\n");
> +
> + if (pset.sversion >= 130000)
> + appendPQExpBufferStr(&buf, " stxstattarget\n");
> + else
> + appendPQExpBufferStr(&buf, " -1 AS stxstattarget\n");
>= 130000 is fully determined by >= 14000 :)
> + * type of the opclass, which is not interesting for our purposes. (Note:
> + * if we did anything with non-expression index columns, we'd need to
index is wrong ?
I mentioned a bunch of other references to "index" and "predicate" which are
still around:
On Thu, Jan 07, 2021 at 08:35:37PM -0600, Justin Pryzby wrote:
> There's some remaining copy/paste stuff from index expressions:
>
> errmsg("statistics expressions and predicates can refer only to the table being indexed")));
> left behind by evaluating the predicate or index expressions.
> Set up for predicate or expression evaluation
> Need an EState for evaluation of index expressions and
> partial-index predicates. Create it in the per-index context to be
> Fetch function for analyzing index expressions.
From | Date | Subject | |
---|---|---|---|
Next Message | Drouvot, Bertrand | 2021-03-24 06:56:51 | Re: Minimal logical decoding on standbys |
Previous Message | Mark Dilger | 2021-03-24 06:13:07 | Re: pg_amcheck contrib application |