RE: Showing applied extended statistics in explain Part 2

From: <Masahiro(dot)Ikeda(at)nttdata(dot)com>
To: <tomas(dot)vondra(at)enterprisedb(dot)com>, <yamatattsu(at)gmail(dot)com>
Cc: <tatsuro(dot)yamada(at)ntt(dot)com>, <pgsql-hackers(at)postgresql(dot)org>, <Masao(dot)Fujii(at)nttdata(dot)com>
Subject: RE: Showing applied extended statistics in explain Part 2
Date: 2024-07-18 10:37:42
Message-ID: TYWPR01MB109820AEAB1E18F89404A8304B1AC2@TYWPR01MB10982.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> Let me share my opinion on those questions ...

Thanks! I could understand the patch well thanks to your comments.


> On 7/12/24 12:09, Masahiro(dot)Ikeda(at)nttdata(dot)com wrote:
> > Is it better to make the order of output consistent? For example, even
> > though there are three clauses shown in the below case, the order does
> > not match.
> > * "Filter" shows that "id1" is first.
> > * "Ext Stats" shows that "id2" is first.
> >
> > -- An example
> > DROP TABLE IF EXISTS test;
> > CREATE TABLE test (id1 int2, id2 int4, id3 int8, value varchar(32));
> > INSERT INTO test (SELECT i%11, i%103, i%1009, 'hello' FROM
> > generate_series(1,1000000) s(i)); create statistics test_s1 on id1,
> > id2 from test; analyze;
> >
> > =# EXPLAIN (STATS) SELECT * FROM test WHERE id1 = 1 AND (id2 = 2 OR id2 > 10);
> > QUERY PLAN
> > ----------------------------------------------------------------------
> > ----------------- Gather (cost=1000.00..23092.77 rows=84311
> > width=20)
> > Workers Planned: 2
> > -> Parallel Seq Scan on test (cost=0.00..13661.67 rows=35130 width=20)
> > Filter: ((id1 = 1) AND ((id2 = 2) OR (id2 > 10)))
> -- here
> > Ext Stats: public.test_s1 Clauses: (((id2 = 2) OR (id2 > 10)) AND (id1 = 1))
> -- here
> > (5 rows)
> >
>
> I don't think we need to make the order consistent. It probably wouldn't hurt, but I'm
> not sure it's even possible for all scan types - for example in an index scan, the clauses
> might be split between index conditions and filters, etc.

OK, I understand it isn't unexpected behavior.

> > (3)
> >
> > I might be misunderstanding something, but do we need the clauses? Is
> > there any case where users would want to know the clauses? For
> > example, wouldn't the following be sufficient?
> >
> >> Ext Stats: id1, id2 using test_s1
> >
>
> The stats may overlap, and some clauses may be matching multiple of them. And some
> statistics do not support all clause types (e.g.
> functional dependencies work only with equality conditions). Yes, you might deduce
> which statistics are used for which clause, but it's not trivial - interpreting explain is
> already not trivial, let's not make it harder.
>
> (If tracking the exact clauses turns out to be expensive, we might revisit this - it might
> make it cheaper).

Thanks. I agree that we need to show the clauses.

> > (4)
> >
> > The extended statistics with "dependencies" or "ndistinct" option
> > don't seem to be shown in EXPLAIN output. Am I missing something? (Is
> > this expected?)
> >
> > I tested the examples in the documentation. Although it might work
> > with "mcv" option, I can't confirm that it works because "unrecognized node type"
> > error occurred in my environment.
> > https://urldefense.com/v3/__https://www.postgresql.org/docs/current/sq
> > l-createstatistics.html__;!!GCTRfqYYOYGmgK_z!9H-FTXrhg7cr0U2r4PoKEeWM1
> > v9feP8I8zlNyhf-801n-KI8bIMAxOQgaetSTpek3ECk2_FKWEsuApVZ-ys-ka7rfjX8ANB
> > 9zQ$
> >
> > (It might be wrong since I'm beginner with extended stats codes.)
> > IIUC, the reason is that the patch only handles
> > statext_mcv_clauselist_selectivity(),
> > and doesn't handle dependencies_clauselist_selectivity() and
> estimate_multivariate_ndistinct().
> >
> >
> > -- doesn't work with "dependencies" option?
> > =# \dX
> > List of extended statistics
> > Schema | Name | Definition | Ndistinct | Dependencies | MCV
> > --------+---------+--------------------+-----------+--------------+---
> > --------+---------+--------------------+-----------+--------------+---
> > --------+---------+--------------------+-----------+--------------+---
> > public | s1 | a, b FROM t1 | (null) | defined | (null)
> > (2 rows)
> >
> > =# EXPLAIN (STATS, ANALYZE) SELECT * FROM t1 WHERE (a = 1) AND (b = 0);
> > QUERY PLAN
> > ----------------------------------------------------------------------
> > ---------------------------------------------
> > Gather (cost=1000.00..11685.00 rows=100 width=8) (actual
> time=0.214..50.327 rows=100 loops=1)
> > Workers Planned: 2
> > Workers Launched: 2
> > -> Parallel Seq Scan on t1 (cost=0.00..10675.00 rows=42 width=8) (actual
> time=30.300..46.610 rows=33 loops=3)
> > Filter: ((a = 1) AND (b = 0))
> > Rows Removed by Filter: 333300 Planning Time: 0.246 ms
> > Execution Time: 50.361 ms
> > (8 rows)
> >
> > -- doesn't work with "ndistinct"?
> > =# \dX
> > List of extended statistics
> > Schema | Name | Definition |
> Ndistinct | Dependencies | MCV
> >
> --------+------+------------------------------------------------------------------+------
> -----+--------------+--------
> > public | s3 | date_trunc('month'::text, a), date_trunc('day'::text, a) FROM t3 |
> defined | (null) | (null)
> > (1 row)
> >
> > postgres(437635)=# EXPLAIN (STATS, ANALYZE) SELECT * FROM t3
> > WHERE date_trunc('month', a) = '2020-01-01'::timestamp;
> > QUERY PLAN
> > ----------------------------------------------------------------------
> > ------------------------------------
> > Seq Scan on t3 (cost=0.00..10210.01 rows=45710 width=8) (actual
> time=0.027..143.199 rows=44640 loops=1)
> > Filter: (date_trunc('month'::text, a) = '2020-01-01 00:00:00'::timestamp
> without time zone)
> > Rows Removed by Filter: 480961
> > Planning Time: 0.088 ms
> > Execution Time: 144.590 ms
> > (5 rows)
> >
> > -- doesn't work with "mvc". It might work, but the error happens in my
> > environments =# \dX
> > List of extended statistics
> > Schema | Name | Definition | Ndistinct | Dependencies | MCV
> > --------+------+--------------+-----------+--------------+---------
> > public | s2 | a, b FROM t2 | (null) | (null) | defined
> > (1 row)
> >
> > -- I encountered the error with the query.
> > =# EXPLAIN (STATS, ANALYZE) SELECT * FROM t2 WHERE (a = 1) AND (b =
> > 1);
> > ERROR: unrecognized node type: 268
> >
> >
>
> Yes, you're right we don't show some stats. For dependencies there's the problem that
> we don't apply them individually, so it's not really possible to map clauses to individual
> stats. I wonder if we might have a special "entry" to show clauses estimated by the
> functional dependencies combined from all stats (instead of a particular statistics).

OK, I understand it's intended behavior for "dependencies" and we need to consider how to
show them in EXPLAIN output in future.

> For ndistinct, I think we don't show this because it doesn't go through
> clauselist_selectivity, which is the only thing I modified in the PoC.
> But I guess we might improve estimate_num_groups() to track the stats in a similar way,
> I guess.

Thanks. IIUC, the reason is that it doesn't go through statext_clauselist_selectivity() because
the number of clauses is one though it goes through clauselist_selectivity().

> > ERROR: unrecognized node type: 268

Regarding the above error, do "applied_stats" need have the list of "StatisticExtInfo"
because it's enough to have the list of Oid(stat->statOid) for EXPLAIN output in the current patch?
change_to_applied_stats_has_list_of_oids.diff is the change I assumed. Do you have any plan to
show extra information for example "kind" of "StatisticExtInfo"?

The above is just one idea came up with while I read the following comments of header
of pathnodes.h, and to support copy "StatisticExtInfo" will leads many other nodes to support copy.
* We don't support copying RelOptInfo, IndexOptInfo, or Path nodes.
* There are some subsidiary structs that are useful to copy, though.

By the way, I found curios result while I tested with the above patch. It shows same "Ext Stats" twice.
I think it's expected behavior because the stat is used when estimate the cost of "Partial HashAggregate" and "Group".
I've shared the result because I could not understand soon when I saw it first time. I think it's better to let users understand
when the stats are used, but I don't have any idea now.

-- I tested with the example of CREATE STATISTICS documentation.
psql=# EXPLAIN (STATS, ANALYZE) SELECT date_trunc('month', a), date_trunc('day', a) FROM t3 GROUP BY 1, 2;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Group (cost=9530.56..9576.18 rows=365 width=16) (actual time=286.908..287.909 rows=366 loops=1)
Group Key: (date_trunc('month'::text, a)), (date_trunc('day'::text, a))
-> Gather Merge (cost=9530.56..9572.53 rows=365 width=16) (actual time=286.904..287.822 rows=498 loops=1)
Workers Planned: 1
Workers Launched: 1
-> Sort (cost=8530.55..8531.46 rows=365 width=16) (actual time=282.905..282.919 rows=249 loops=2)
Sort Key: (date_trunc('month'::text, a)), (date_trunc('day'::text, a))
Sort Method: quicksort Memory: 32kB
Worker 0: Sort Method: quicksort Memory: 32kB
-> Partial HashAggregate (cost=8509.54..8515.02 rows=365 width=16) (actual time=282.716..282.768 rows=249 loops=2)
Group Key: date_trunc('month'::text, a), date_trunc('day'::text, a)
Batches: 1 Memory Usage: 45kB
Worker 0: Batches: 1 Memory Usage: 45kB
-> Parallel Seq Scan on t3 (cost=0.00..6963.66 rows=309177 width=16) (actual time=0.021..171.214 rows=262800 loops=2)
Ext Stats: public.s3 Clauses: date_trunc('month'::text, a), date_trunc('day'::text, a) -- here
Ext Stats: public.s3 Clauses: date_trunc('month'::text, a), date_trunc('day'::text, a) -- here
Planning Time: 114327.206 ms
Execution Time: 288.007 ms
(18 rows)

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachment Content-Type Size
change_to_applied_stats_has_list_of_oids.diff application/octet-stream 2.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hayato Kuroda (Fujitsu) 2024-07-18 10:49:04 RE: [Proposal] Add foreign-server health checks infrastructure
Previous Message Ashutosh Bapat 2024-07-18 10:28:30 Re: Add mention of execution time memory for enable_partitionwise_* GUCs