Re: Showing applied extended statistics in explain Part 2

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Tatsuro Yamada <yamatattsu(at)gmail(dot)com>
Cc: Andrei Lepikhov <lepihov(at)gmail(dot)com>, Masahiro(dot)Ikeda(at)nttdata(dot)com, tatsuro(dot)yamada(at)ntt(dot)com, pgsql-hackers(at)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com, ilya(dot)evdokimov(at)tantorlabs(dot)com
Subject: Re: Showing applied extended statistics in explain Part 2
Date: 2025-02-12 11:11:36
Message-ID: 7b2aac96-4b56-48b8-a07b-0ee877f3732c@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2/12/25 06:02, Tatsuro Yamada wrote:
> Hi Tomas and ALL,
>
>>I wonder what Yamada-san thinks about these suggestions ... He's the one
>>actually developing the patch, so I'd like to know his opinions.
>
> I will state my thoughts on the two points of discussion.
>
>>>>> I often use SQL Server to compare execution plans generated by
>>>>> PostgreSQL, and I appreciate how they display the usage of extended
>>>>> statistics. They clearly identify which statistics were utilized during
>>>>> query planning and include them in the summary section of query plan.
>>>>>
>>>> I admit not knowing what exactly SQL Server shows in the explain, but it
>>>> seems helpful to know which part of the plan used which statistics, no?
>>>
>>> I wonder if you meant clauses, not a part of the plan. Clauses may be
>>> used on different levels of the join tree, but they are estimated once.
>>>
>>> I meant we may just refer to the statistic used by its name in the
>>> node's explanation if this node contains estimated clause. I use
>>> multiple extended statistics. Sometimes, they intersect, and it is hard
>>> to say which one was used for a specific clause.
>>
>>I did mean "part of the plan", i.e. the actual operation. That being
>>said, maybe you're right the exact node doesn't matter all that much,
>>and it'd be better to have a separate list of stats for the whole plan.
>>
>>That'd probably work better for stuff like estimate_num_groups(), which
>>is hard to assign to a particular operation ...
>
>
> The patch currently under development aims to display the actual usage of
> extended statistics on a plan-by-plan basis.
> I thought that being able to check the extended statistics used by plan
> nodes
> in a plan would make it easier to tune plans using extended statistics.
>
> On the other hand, I also think that a mechanism to check the usage of
> extended statistics on a DB-by-DB basis would be useful.
> I talked about this idea in my talk at PGConf.dev 2024.
> This is similar to pg_stat_user_indexes for indexes. If you can find
> unused or
> infrequently used extended statistics, you can delete unnecessary ones.
> However, this functionality is not included in this patch. I plan to
> provide it as 
> a separate patch after the current patch is committed.
>

+1 to treating the DB-level stuff (i.e. system catalog showing how often
was each statistics used, etc.) as a separate feature / patch. I did not
mean to suggest this should be included in this patch, my point was that
it's very different from what this patch aimed to do.

>
>>>> I may be wrong, but I don't quite see why would that be significantly
>>>> easier to implement. You still need to track the information somewhere,
>>>> and it probably needs for individual Path nodes. Because while building
>>>> the plan you don't know which paths will get selected.
>>> In my mind, gathering extended statistics usage data may be implemented
>>> like it is already done for many parameters in the stat collector. It
>>> may help identify whether statistics are used frequently or not.
>>
>>Isn't the statistics collector dealing with a completely different use
>>case? Sure, maybe it would be interesting to track how many queries used
>>which statistics object / for what purpose. A bit like we track counts
>>for indexes etc.
>>
>>But that seems very different from what this patch aims to do, which is
>>to track info about how a particular query used extended stats.
>>
>>I suppose we could have a "buffer" to remember which statistics objects
>>were used by a single statement/transaction, just like we do for tables
>>or indexes. But I still think we should show clauses estimated by each
>>statistics object in a given query, and this would not help with that.
>
> Ideally, it would be nice to be able to show which extended statistics
> were used for which clauses. However, based on the discussion so far,
> this seems difficult to implement.
>

Maybe I'm missing something, but why would this be difficult to
implement? There was a problem with RestrictInfo nodes, but doesn't
Tom's suggestion [1] to use extract_actual_clauses() address this? I
might have done something silly in the PoC patch.

https://www.postgresql.org/message-id/1691487.1733180716@sss.pgh.pa.us

> A second best solution would be to add a summary of the extended statistics
> used at the end of the plan. This alone would make it easier for users to
> use the extended statistics.
>
> The expected output is as follows:
>     - Do not output "Clauses: ".
>     - Display only the names of the extended statistics used.
>     - If displaying the names of multiple extended statistics,
>       should they be joined by commas? (Or should Ext Stats: be displayed
>       on multiple lines?)
>

The list of stats is better than no information at all, but I don't
quite see why we couldn't show clauses, and to me the clauses seem like
a pretty important piece of information.

> e.g.,
> explain (stats) select 1 from t where a = 1 and b =1;
>
> ## Output from the current patch
>                        QUERY PLAN
> --------------------------------------------------------
>  Seq Scan on t  (cost=0.00..1943.00 rows=10210 width=4)
>    Filter: ((a = 1) AND (b = 1))
>    Ext Stats: s  Clauses: ((a = 1) AND (b = 1))
> (3 rows)
>
>
> ## Output for displaying summary extended statistics
>                        QUERY PLAN
> --------------------------------------------------------
>  Seq Scan on t  (cost=0.00..1943.00 rows=10210 width=4)
>    Filter: ((a = 1) AND (b = 1))
>  Planning:
>    Ext Stats: s  <== Is it acceptable?
>
> If we do not display "Clause" information, I believe that there is
> no need to deparse clauses, and therefore no need to improve the deparse
> function to enable recursive traversal of expression trees.
>

What do you mean by the need to improve the deparse function? Is this
about the nested RestrictInfos? Have you tried doing that Tom suggested
in [1], or does it not address the issue?

> To further the discussion, I will modify the patch to display the summary
> as shown above and send it to -hackers.
>
> P.S.
> I have tried to understand the discussion correctly, but please
> let me know if I have misunderstood anything.
>

Thanks for working on this!

regards

--
Tomas Vondra

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Álvaro Herrera 2025-02-12 11:13:52 Re: NOT ENFORCED constraint feature
Previous Message Ashutosh Bapat 2025-02-12 10:56:50 Re: TAP test command_fails versus command_fails_like