RE: Showing applied extended statistics in explain Part 2

From: <Masahiro(dot)Ikeda(at)nttdata(dot)com>
To: <yamatattsu(at)gmail(dot)com>, <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: <tatsuro(dot)yamada(at)ntt(dot)com>, <pgsql-hackers(at)postgresql(dot)org>, <Masao(dot)Fujii(at)nttdata(dot)com>
Subject: RE: Showing applied extended statistics in explain Part 2
Date: 2024-07-12 10:09:42
Message-ID: TYWPR01MB109825D40CB18BED80B6B4140B1A62@TYWPR01MB10982.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Thanks for working the feature. As a user, I find it useful, and I'd like to use
it in v18! Although I've just started start looking into it, I have a few questions.

(1)

Is it better to make the order of output consistent? For example, even
though there are three clauses shown in the below case, the order does not
match.
* "Filter" shows that "id1" is first.
* "Ext Stats" shows that "id2" is first.

-- An example
DROP TABLE IF EXISTS test;
CREATE TABLE test (id1 int2, id2 int4, id3 int8, value varchar(32));
INSERT INTO test (SELECT i%11, i%103, i%1009, 'hello' FROM generate_series(1,1000000) s(i));
create statistics test_s1 on id1, id2 from test; analyze;

=# EXPLAIN (STATS) SELECT * FROM test WHERE id1 = 1 AND (id2 = 2 OR id2 > 10);
QUERY PLAN
---------------------------------------------------------------------------------------
Gather (cost=1000.00..23092.77 rows=84311 width=20)
Workers Planned: 2
-> Parallel Seq Scan on test (cost=0.00..13661.67 rows=35130 width=20)
Filter: ((id1 = 1) AND ((id2 = 2) OR (id2 > 10))) -- here
Ext Stats: public.test_s1 Clauses: (((id2 = 2) OR (id2 > 10)) AND (id1 = 1)) -- here
(5 rows)

(2)

Do we really need the schema names without VERBOSE option? As in the above case,
"Ext Stats" shows schema name "public", even though the table name "test" isn't
shown with its schema name.

Additionally, if the VERBOSE option is specified, should the column names also be
printed with namespace?

=# EXPLAIN (VERBOSE, STATS) SELECT * FROM test WHERE id1 = 1 AND (id2 = 2 OR id2 > 10);
QUERY PLAN
---------------------------------------------------------------------------------------
Gather (cost=1000.00..22947.37 rows=82857 width=20)
Output: id1, id2, id3, value
Workers Planned: 2
-> Parallel Seq Scan on public.test (cost=0.00..13661.67 rows=34524 width=20)
Output: id1, id2, id3, value
Filter: ((test.id1 = 1) AND ((test.id2 = 2) OR (test.id2 > 10)))
Ext Stats: public.test_s1 Clauses: (((id2 = 2) OR (id2 > 10)) AND (id1 = 1)) -- here
(7 rows)

(3)

I might be misunderstanding something, but do we need the clauses? Is there any
case where users would want to know the clauses? For example, wouldn't the
following be sufficient?

> Ext Stats: id1, id2 using test_s1

(4)

The extended statistics with "dependencies" or "ndistinct" option don't seem to
be shown in EXPLAIN output. Am I missing something? (Is this expected?)

I tested the examples in the documentation. Although it might work with
"mcv" option, I can't confirm that it works because "unrecognized node type"
error occurred in my environment.
https://www.postgresql.org/docs/current/sql-createstatistics.html

(It might be wrong since I'm beginner with extended stats codes.)
IIUC, the reason is that the patch only handles statext_mcv_clauselist_selectivity(),
and doesn't handle dependencies_clauselist_selectivity() and estimate_multivariate_ndistinct().

-- doesn't work with "dependencies" option?
=# \dX
List of extended statistics
Schema | Name | Definition | Ndistinct | Dependencies | MCV
--------+---------+--------------------+-----------+--------------+---------
public | s1 | a, b FROM t1 | (null) | defined | (null)
(2 rows)

=# EXPLAIN (STATS, ANALYZE) SELECT * FROM t1 WHERE (a = 1) AND (b = 0);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..11685.00 rows=100 width=8) (actual time=0.214..50.327 rows=100 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on t1 (cost=0.00..10675.00 rows=42 width=8) (actual time=30.300..46.610 rows=33 loops=3)
Filter: ((a = 1) AND (b = 0))
Rows Removed by Filter: 333300
Planning Time: 0.246 ms
Execution Time: 50.361 ms
(8 rows)

-- doesn't work with "ndistinct"?
=# \dX
List of extended statistics
Schema | Name | Definition | Ndistinct | Dependencies | MCV
--------+------+------------------------------------------------------------------+-----------+--------------+--------
public | s3 | date_trunc('month'::text, a), date_trunc('day'::text, a) FROM t3 | defined | (null) | (null)
(1 row)

postgres(437635)=# EXPLAIN (STATS, ANALYZE) SELECT * FROM t3
WHERE date_trunc('month', a) = '2020-01-01'::timestamp;
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Seq Scan on t3 (cost=0.00..10210.01 rows=45710 width=8) (actual time=0.027..143.199 rows=44640 loops=1)
Filter: (date_trunc('month'::text, a) = '2020-01-01 00:00:00'::timestamp without time zone)
Rows Removed by Filter: 480961
Planning Time: 0.088 ms
Execution Time: 144.590 ms
(5 rows)

-- doesn't work with "mvc". It might work, but the error happens in my environments
=# \dX
List of extended statistics
Schema | Name | Definition | Ndistinct | Dependencies | MCV
--------+------+--------------+-----------+--------------+---------
public | s2 | a, b FROM t2 | (null) | (null) | defined
(1 row)

-- I encountered the error with the query.
=# EXPLAIN (STATS, ANALYZE) SELECT * FROM t2 WHERE (a = 1) AND (b = 1);
ERROR: unrecognized node type: 268

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Junwang Zhao 2024-07-12 10:15:37 Re: MERGE/SPLIT partition commands should create new partitions in the parent's tablespace?
Previous Message Antonin Houska 2024-07-12 10:01:11 Re: Missed opportunity for bsearch() in TransactionIdIsCurrentTransactionId()?