Re: Showing applied extended statistics in explain Part 2

From: Tatsuro Yamada <yamatattsu(at)gmail(dot)com>
To: Masahiro(dot)Ikeda(at)nttdata(dot)com
Cc: tomas(dot)vondra(at)enterprisedb(dot)com, tatsuro(dot)yamada(at)ntt(dot)com, pgsql-hackers(at)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject: Re: Showing applied extended statistics in explain Part 2
Date: 2024-11-01 05:22:52
Message-ID: CAOKkKFvDu_XiBNOhRGfRmr0j-Si_af=TkspmmJ+cj8wG0QaKsw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi All,

I apologize for not being able to continue development due to various
circumstances.
The attached file is the rebased patch.
I will now catch up on the discussion and try to revise the patch.

Regards,
Tatsuro Yamada

On Fri, Jul 19, 2024 at 7:17 PM <Masahiro(dot)Ikeda(at)nttdata(dot)com> wrote:

> > On 7/18/24 12:37, Masahiro(dot)Ikeda(at)nttdata(dot)com wrote:
> > >> Let me share my opinion on those questions ...
> > > ...>
> > >> For ndistinct, I think we don't show this because it doesn't go
> > >> through clauselist_selectivity, which is the only thing I modified in
> the PoC.
> > >> But I guess we might improve estimate_num_groups() to track the stats
> > >> in a similar way, I guess.
> > >
> > > Thanks. IIUC, the reason is that it doesn't go through
> > > statext_clauselist_selectivity() because the number of clauses is one
> though it goes
> > through clauselist_selectivity().
> > >
> >
> > Ah, I see I misunderstood the original report. The query used was
> >
> > EXPLAIN (STATS, ANALYZE) SELECT * FROM t3
> > WHERE date_trunc('month', a) = '2020-01-01'::timestamp;
> >
> > And it has nothing to do with the number of clauses being one neither.
> >
> > The problem is this estimate is handled by examine_variable() matching
> the expression
> > to the "expression" stats, and injecting it into the variable, so that
> the
> > clauselist_selectivity() sees these stats.
> >
> > This would happen even if you build just expression statistics on each
> of the
> > date_trunc() calls, and then tried a query with two clauses:
> >
> > CREATE STATISTICS s4 ON date_trunc('day', a) FROM t3;
> > CREATE STATISTICS s3 ON date_trunc('month', a) FROM t3;
> >
> > EXPLAIN SELECT * FROM t3
> > WHERE date_trunc('month', a) = '2020-01-01'::timestamp
> > AND date_trunc('day', 'a') = '2020-01-01'::timestamp;
> >
> > Not sure how to handle this - we could remember when explain_variable()
> injects
> > statistics like this, I guess. But do we know that each call to
> > examine_variable() is for estimation? And do we know for which clause?
>
> I see. The issue is related to extended statistics for single expression.
> As a
> first step, it's ok for me that we don't support it.
>
> The below is just an idea to know clauses...
> Although I'm missing something, can callers of examine_variable()
> for estimation to rebuild the clauses from partial information of "OpExpr"?
>
> Only clause_selectivity_ext() knows the information of actual full clauses.
> But we don't need full information. It's enough to know the information
> to show "OpExpr" for EXPLAIN.
>
> get_oper_expr() deparse "OpExpr" using only the operator oid and arguments
> in get_oper_expr().
>
> If so, the caller to estimate, for example eqsel_internal(),
> scalarineqsel_wrapper()
> and so on, seems to be able to know the "OpExpr" information, which are
> operator
> oid and arguments, and used extended statistics easily to show for EXPLAIN.
>
> # Memo: the call path of the estimation function
> caller to estimate selectivity
> (eqsel_internal()/scalargtjoinsel_wrappter()/...)
> -> get_restriction_variable()/get_join_valiables()
> -> examine_variable()
>
>
> > >>> ERROR: unrecognized node type: 268
> > >
> > > Regarding the above error, do "applied_stats" need have the list of
> "StatisticExtInfo"
> > > because it's enough to have the list of Oid(stat->statOid) for EXPLAIN
> output in the
> > current patch?
> > > change_to_applied_stats_has_list_of_oids.diff is the change I assumed.
> > > Do you have any plan to show extra information for example "kind" of
> > "StatisticExtInfo"?
> > >
> > > The above is just one idea came up with while I read the following
> > > comments of header of pathnodes.h, and to support copy
> "StatisticExtInfo" will leads
> > many other nodes to support copy.
> > > * We don't support copying RelOptInfo, IndexOptInfo, or Path nodes.
> > > * There are some subsidiary structs that are useful to copy, though.
> > >
> >
> > I do think tracking just the OID would work, because we already know how
> to copy List
> > objects. But if we want to also track the clauses, we'd have to keep
> multiple lists, right?
> > That seems a bit inconvenient.
>
> Understood. In future, we might show not only the applied_clauses but also
> the clauses of
> its extended statistics (StatisticExtInfo->exprs).
>
>
> > > By the way, I found curios result while I tested with the above patch.
> It shows same
> > "Ext Stats" twice.
> > > I think it's expected behavior because the stat is used when estimate
> the cost of
> > "Partial HashAggregate" and "Group".
> > > I've shared the result because I could not understand soon when I saw
> > > it first time. I think it's better to let users understand when the
> stats are used, but I
> > don't have any idea now.
> > >
> > > -- I tested with the example of CREATE STATISTICS documentation.
> > > psql=# EXPLAIN (STATS, ANALYZE) SELECT date_trunc('month', a),
> date_trunc('day',
> > a) FROM t3 GROUP BY 1, 2;
> > > QUERY
> PLAN
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > - Group (cost=9530.56..9576.18 rows=365 width=16) (actual
> > > time=286.908..287.909 rows=366 loops=1)
> > > Group Key: (date_trunc('month'::text, a)), (date_trunc('day'::text,
> a))
> > > -> Gather Merge (cost=9530.56..9572.53 rows=365 width=16) (actual
> > time=286.904..287.822 rows=498 loops=1)
> > > Workers Planned: 1
> > > Workers Launched: 1
> > > -> Sort (cost=8530.55..8531.46 rows=365 width=16) (actual
> > time=282.905..282.919 rows=249 loops=2)
> > > Sort Key: (date_trunc('month'::text, a)),
> (date_trunc('day'::text, a))
> > > Sort Method: quicksort Memory: 32kB
> > > Worker 0: Sort Method: quicksort Memory: 32kB
> > > -> Partial HashAggregate (cost=8509.54..8515.02
> rows=365
> > width=16) (actual time=282.716..282.768 rows=249 loops=2)
> > > Group Key: date_trunc('month'::text, a),
> date_trunc('day'::text,
> > a)
> > > Batches: 1 Memory Usage: 45kB
> > > Worker 0: Batches: 1 Memory Usage: 45kB
> > > -> Parallel Seq Scan on t3 (cost=0.00..6963.66
> rows=309177
> > width=16) (actual time=0.021..171.214 rows=262800 loops=2)
> > > Ext Stats: public.s3 Clauses:
> date_trunc('month'::text,
> > a), date_trunc('day'::text, a) -- here
> > > Ext Stats: public.s3 Clauses:
> > > date_trunc('month'::text, a), date_trunc('day'::text, a) -- here
> > > Planning Time: 114327.206 ms Execution Time: 288.007 ms
> > > (18 rows)
> > >
> >
> > I haven't looked into this, but my guess would be this is somehow
> related to the
> > parallelism - there's one parallel worker, which means we have 2
> processes to report
> > stats for (leader + worker). And you get two copies of the "Ext Stats"
> line. Could be a
> > coincidence, ofc, but maybe there's a loop to print some worker info,
> and you print the
> > statistics info in it?
>
> I think yes and no. In the above case, it relates to parallelism, but it
> doesn't print the
> information per each worker.
>
> -- Make the number of workers is 5 and EXPLAIN without ANALYZE option.
> -- But "Ext Stats" is printed only twice.
> =# EXPLAIN (STATS) SELECT date_trunc('month', a), date_trunc('day', a)
> FROM t3 GROUP BY 1, 2;
> QUERY PLAN
>
>
> -------------------------------------------------------------------------------------------------------------------
> Group (cost=4449.49..4489.50 rows=365 width=16)
> Group Key: (date_trunc('month'::text, a)), (date_trunc('day'::text, a))
> -> Gather Merge (cost=4449.49..4478.55 rows=1825 width=16)
> Workers Planned: 5
> -> Sort (cost=4449.41..4450.32 rows=365 width=16)
> Sort Key: (date_trunc('month'::text, a)),
> (date_trunc('day'::text, a))
> -> Partial HashAggregate (cost=4428.40..4433.88 rows=365
> width=16)
> Group Key: date_trunc('month'::text, a),
> date_trunc('day'::text, a)
> -> Parallel Seq Scan on t3 (cost=0.00..3902.80
> rows=105120 width=16)
> Ext Stats: public.s3 Clauses:
> date_trunc('month'::text, a), date_trunc('day'::text, a)
> Ext Stats: public.s3 Clauses:
> date_trunc('month'::text, a), date_trunc('day'::text, a)
> (11 rows)
>
> When creating a group path, it creates partial grouping paths if possible,
> and then
> creates the final grouping path. At this time, both the partial grouping
> path and
> the final grouping path use the same RelOptInfo to repeatedly use the
> extended
> statistics to know how many groups there will be. That's why it outputs
> only twice.
> There may be other similar calculation for partial paths.
>
> # The call path of the above query
> create_grouping_paths
> create_ordinary_grouping_paths
> create_partial_grouping_paths
> get_number_of_groups
> estimate_num_groups
> estimate_multivariate_ndistinct -- first time to estimate the
> number of groups for partial grouping path
> get_number_of_groups
> estimate_num_groups
> estimate_multivariate_ndistinct -- second time to estimate the
> number of groups for final grouping path
>
>
> Regards,
> --
> Masahiro Ikeda
> NTT DATA CORPORATION
>

Attachment Content-Type Size
0001-Add-a-new-option-STATS-to-Explain-command_r2.patch application/octet-stream 20.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message jian he 2024-11-01 05:39:11 Re: Wrong result when enable_partitionwise_join is on if collation of PartitionKey and Column is different.
Previous Message Andrey M. Borodin 2024-11-01 04:53:36 Re: UUID v7