Re: Showing applied extended statistics in explain Part 2

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Masahiro(dot)Ikeda(at)nttdata(dot)com, yamatattsu(at)gmail(dot)com
Cc: tatsuro(dot)yamada(at)ntt(dot)com, pgsql-hackers(at)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject: Re: Showing applied extended statistics in explain Part 2
Date: 2024-07-15 14:40:19
Message-ID: 5b113688-adc2-4dd8-bc3e-2fcf3e0bedae@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Let me share my opinion on those questions ...

On 7/12/24 12:09, Masahiro(dot)Ikeda(at)nttdata(dot)com wrote:
> Hi,
>
> Thanks for working the feature. As a user, I find it useful, and I'd like to use
> it in v18! Although I've just started start looking into it, I have a few questions.
>
>
> (1)
>
> Is it better to make the order of output consistent? For example, even
> though there are three clauses shown in the below case, the order does not
> match.
> * "Filter" shows that "id1" is first.
> * "Ext Stats" shows that "id2" is first.
>
> -- An example
> DROP TABLE IF EXISTS test;
> CREATE TABLE test (id1 int2, id2 int4, id3 int8, value varchar(32));
> INSERT INTO test (SELECT i%11, i%103, i%1009, 'hello' FROM generate_series(1,1000000) s(i));
> create statistics test_s1 on id1, id2 from test; analyze;
>
> =# EXPLAIN (STATS) SELECT * FROM test WHERE id1 = 1 AND (id2 = 2 OR id2 > 10);
> QUERY PLAN
> ---------------------------------------------------------------------------------------
> Gather (cost=1000.00..23092.77 rows=84311 width=20)
> Workers Planned: 2
> -> Parallel Seq Scan on test (cost=0.00..13661.67 rows=35130 width=20)
> Filter: ((id1 = 1) AND ((id2 = 2) OR (id2 > 10))) -- here
> Ext Stats: public.test_s1 Clauses: (((id2 = 2) OR (id2 > 10)) AND (id1 = 1)) -- here
> (5 rows)
>

I don't think we need to make the order consistent. It probably wouldn't
hurt, but I'm not sure it's even possible for all scan types - for
example in an index scan, the clauses might be split between index
conditions and filters, etc.

>
>
> (2)
>
> Do we really need the schema names without VERBOSE option? As in the above case,
> "Ext Stats" shows schema name "public", even though the table name "test" isn't
> shown with its schema name.
>
> Additionally, if the VERBOSE option is specified, should the column names also be
> printed with namespace?
>
> =# EXPLAIN (VERBOSE, STATS) SELECT * FROM test WHERE id1 = 1 AND (id2 = 2 OR id2 > 10);
> QUERY PLAN
> ---------------------------------------------------------------------------------------
> Gather (cost=1000.00..22947.37 rows=82857 width=20)
> Output: id1, id2, id3, value
> Workers Planned: 2
> -> Parallel Seq Scan on public.test (cost=0.00..13661.67 rows=34524 width=20)
> Output: id1, id2, id3, value
> Filter: ((test.id1 = 1) AND ((test.id2 = 2) OR (test.id2 > 10)))
> Ext Stats: public.test_s1 Clauses: (((id2 = 2) OR (id2 > 10)) AND (id1 = 1)) -- here
> (7 rows)
>

Yeah, I don't think there's a good reason to force printing schema for
the statistics, if it's not needed for the table. The rules should be
the same, I think.

>
>
> (3)
>
> I might be misunderstanding something, but do we need the clauses? Is there any
> case where users would want to know the clauses? For example, wouldn't the
> following be sufficient?
>
>> Ext Stats: id1, id2 using test_s1
>

The stats may overlap, and some clauses may be matching multiple of
them. And some statistics do not support all clause types (e.g.
functional dependencies work only with equality conditions). Yes, you
might deduce which statistics are used for which clause, but it's not
trivial - interpreting explain is already not trivial, let's not make it
harder.

(If tracking the exact clauses turns out to be expensive, we might
revisit this - it might make it cheaper).

>
>
> (4)
>
> The extended statistics with "dependencies" or "ndistinct" option don't seem to
> be shown in EXPLAIN output. Am I missing something? (Is this expected?)
>
> I tested the examples in the documentation. Although it might work with
> "mcv" option, I can't confirm that it works because "unrecognized node type"
> error occurred in my environment.
> https://www.postgresql.org/docs/current/sql-createstatistics.html
>
> (It might be wrong since I'm beginner with extended stats codes.)
> IIUC, the reason is that the patch only handles statext_mcv_clauselist_selectivity(),
> and doesn't handle dependencies_clauselist_selectivity() and estimate_multivariate_ndistinct().
>
>
> -- doesn't work with "dependencies" option?
> =# \dX
> List of extended statistics
> Schema | Name | Definition | Ndistinct | Dependencies | MCV
> --------+---------+--------------------+-----------+--------------+---------
> public | s1 | a, b FROM t1 | (null) | defined | (null)
> (2 rows)
>
> =# EXPLAIN (STATS, ANALYZE) SELECT * FROM t1 WHERE (a = 1) AND (b = 0);
> QUERY PLAN
> -------------------------------------------------------------------------------------------------------------------
> Gather (cost=1000.00..11685.00 rows=100 width=8) (actual time=0.214..50.327 rows=100 loops=1)
> Workers Planned: 2
> Workers Launched: 2
> -> Parallel Seq Scan on t1 (cost=0.00..10675.00 rows=42 width=8) (actual time=30.300..46.610 rows=33 loops=3)
> Filter: ((a = 1) AND (b = 0))
> Rows Removed by Filter: 333300
> Planning Time: 0.246 ms
> Execution Time: 50.361 ms
> (8 rows)
>
> -- doesn't work with "ndistinct"?
> =# \dX
> List of extended statistics
> Schema | Name | Definition | Ndistinct | Dependencies | MCV
> --------+------+------------------------------------------------------------------+-----------+--------------+--------
> public | s3 | date_trunc('month'::text, a), date_trunc('day'::text, a) FROM t3 | defined | (null) | (null)
> (1 row)
>
> postgres(437635)=# EXPLAIN (STATS, ANALYZE) SELECT * FROM t3
> WHERE date_trunc('month', a) = '2020-01-01'::timestamp;
> QUERY PLAN
> ----------------------------------------------------------------------------------------------------------
> Seq Scan on t3 (cost=0.00..10210.01 rows=45710 width=8) (actual time=0.027..143.199 rows=44640 loops=1)
> Filter: (date_trunc('month'::text, a) = '2020-01-01 00:00:00'::timestamp without time zone)
> Rows Removed by Filter: 480961
> Planning Time: 0.088 ms
> Execution Time: 144.590 ms
> (5 rows)
>
> -- doesn't work with "mvc". It might work, but the error happens in my environments
> =# \dX
> List of extended statistics
> Schema | Name | Definition | Ndistinct | Dependencies | MCV
> --------+------+--------------+-----------+--------------+---------
> public | s2 | a, b FROM t2 | (null) | (null) | defined
> (1 row)
>
> -- I encountered the error with the query.
> =# EXPLAIN (STATS, ANALYZE) SELECT * FROM t2 WHERE (a = 1) AND (b = 1);
> ERROR: unrecognized node type: 268
>
>

Yes, you're right we don't show some stats. For dependencies there's the
problem that we don't apply them individually, so it's not really
possible to map clauses to individual stats. I wonder if we might have a
special "entry" to show clauses estimated by the functional dependencies
combined from all stats (instead of a particular statistics).

For ndistinct, I think we don't show this because it doesn't go through
clauselist_selectivity, which is the only thing I modified in the PoC.
But I guess we might improve estimate_num_groups() to track the stats in
a similar way, I guess.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Chapman Flack 2024-07-15 14:46:06 Re: add function argument names to regex* functions.
Previous Message David E. Wheeler 2024-07-15 14:29:32 Re: Patch bug: Fix jsonpath .* on Arrays