From: | Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> |
---|---|
To: | Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PoC/WIP: Extended statistics on expressions |
Date: | 2021-01-19 01:57:05 |
Message-ID: | 3f20da4d-89dc-0912-98d2-b6f0ef06c87f@enterprisedb.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 1/18/21 4:48 PM, Dean Rasheed wrote:
> Looking through extended_stats.c, I found a corner case that can lead
> to a seg-fault:
>
> CREATE TABLE foo();
> CREATE STATISTICS s ON (1) FROM foo;
> ANALYSE foo;
>
> This crashes in lookup_var_attr_stats(), because it isn't expecting
> nvacatts to be 0. I can't think of any case where building stats on a
> table with no analysable columns is useful, so it should probably just
> exit early in that case.
>
>
> In BuildRelationExtStatistics(), it looks like min_attrs should be
> declared assert-only.
>
>
> In evaluate_expressions():
>
> + /* set the pointers */
> + result = (ExprInfo *) ptr;
> + ptr += sizeof(ExprInfo);
>
> I think that should probably have a MAXALIGN().
>
Thanks, I'll fix all of that.
>
> A slightly bigger issue that I don't like is the way it assigns
> attribute numbers for expressions starting from
> MaxHeapAttributeNumber+1, so the first expression has an attnum of
> 1601. That leads to pretty inefficient use of Bitmapsets, since most
> tables only contain a handful of columns, and there's a large block of
> unused space in the middle the Bitmapset.
>
> An alternative approach might be to use regular attnums for columns
> and use negative indexes -1, -2, -3, ... for expressions in the stored
> stats. Then when storing and retrieving attnums from Bitmapsets, it
> could just offset by STATS_MAX_DIMENSIONS (8) to avoid negative values
> in the Bitmapsets, since there can't be more than that many
> expressions (just like other code stores system attributes using
> FirstLowInvalidHeapAttributeNumber).
>
> That would be a somewhat bigger change, but hopefully fairly
> mechanical, and then some code like add_expressions_to_attributes()
> would go away.
>
Well, I tried this but unfortunately it's not that simple. We still need
to build the bitmaps, so I don't think add_expression_to_attributes can
be just removed. I mean, we need to do the offsetting somewhere, even if
we change how we do it.
But the main issue is that in some cases the number of expressions is
not really limited by STATS_MAX_DIMENSIONS - for example when applying
functional dependencies, we "merge" multiple statistics, so we may end
up with more expressions. So we can't just use STATS_MAX_DIMENSIONS.
Also, if we offset regular attnums by STATS_MAX_DIMENSIONS, that inverts
the order of processing (so far we've assumed expressions are after
regular attnums). So the changes are more extensive - I tried doing that
anyway, and I'm still struggling with crashes and regression failures.
Of course, that doesn't mean we shouldn't do it, but it's far from
mechanical. (Some of that is probably a sign this code needs a bit more
work to polish.)
But I wonder if it'd be easier to just calculate the actual max attnum
and then use it instead of MaxHeapAttributeNumber ...
>
> Looking at the new view pg_stats_ext_exprs, I noticed that it fails to
> show expressions until the statistics have been built. For example:
>
> CREATE TABLE foo(a int, b int);
> CREATE STATISTICS s ON (a+b), (a*b) FROM foo;
> SELECT statistics_name, tablename, expr, n_distinct FROM pg_stats_ext_exprs;
>
> statistics_name | tablename | expr | n_distinct
> -----------------+-----------+------+------------
> s | foo | |
> (1 row)
>
> but after populating and analysing the table, this becomes:
>
> statistics_name | tablename | expr | n_distinct
> -----------------+-----------+---------+------------
> s | foo | (a + b) | 11
> s | foo | (a * b) | 11
> (2 rows)
>
> I think it should show the expressions even before the stats have been built.
>
> Another issue is that it returns rows for non-expression stats as
> well. For example:
>
> CREATE TABLE foo(a int, b int);
> CREATE STATISTICS s ON a, b FROM foo;
> SELECT statistics_name, tablename, expr, n_distinct FROM pg_stats_ext_exprs;
>
> statistics_name | tablename | expr | n_distinct
> -----------------+-----------+------+------------
> s | foo | |
> (1 row)
>
> and those values will never be populated, since they're not
> expressions, so I would expect them to not be shown in the view.
>
> So basically, instead of
>
> + LEFT JOIN LATERAL (
> + SELECT
> + *
> + FROM (
> + SELECT
> +
> unnest(pg_get_statisticsobjdef_expressions(s.oid)) AS expr,
> + unnest(sd.stxdexpr)::pg_statistic AS a
> + ) x
> + ) stat ON sd.stxdexpr IS NOT NULL;
>
> perhaps just
>
> + JOIN LATERAL (
> + SELECT
> + *
> + FROM (
> + SELECT
> +
> unnest(pg_get_statisticsobjdef_expressions(s.oid)) AS expr,
> + unnest(sd.stxdexpr)::pg_statistic AS a
> + ) x
> + ) stat ON true;
Will fix.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From | Date | Subject | |
---|---|---|---|
Next Message | Ian Lawrence Barwick | 2021-01-19 02:03:34 | Re: psql \df choose functions by their arguments |
Previous Message | tsunakawa.takay@fujitsu.com | 2021-01-19 01:56:50 | RE: POC: postgres_fdw insert batching |