Quick Links

Re: Showing applied extended statistics in explain Part 2

From:	Tatsuro Yamada <yamatattsu(at)gmail(dot)com>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Andrei Lepikhov <lepihov(at)gmail(dot)com>, Masahiro(dot)Ikeda(at)nttdata(dot)com, tatsuro(dot)yamada(at)ntt(dot)com, pgsql-hackers(at)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com, ilya(dot)evdokimov(at)tantorlabs(dot)com
Subject:	Re: Showing applied extended statistics in explain Part 2
Date:	2025-02-12 05:02:20
Message-ID:	CAOKkKFv0VwVZDsxzA_GAjwPv2PJsqfQPQvf=K7BcxvLznrmkrA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi Tomas and ALL,

>I wonder what Yamada-san thinks about these suggestions ... He's the one
>actually developing the patch, so I'd like to know his opinions.

I will state my thoughts on the two points of discussion.

>>>> I often use SQL Server to compare execution plans generated by
>>>> PostgreSQL, and I appreciate how they display the usage of extended
>>>> statistics. They clearly identify which statistics were utilized during
>>>> query planning and include them in the summary section of query plan.
>>>>
>>> I admit not knowing what exactly SQL Server shows in the explain, but it
>>> seems helpful to know which part of the plan used which statistics, no?
>>
>> I wonder if you meant clauses, not a part of the plan. Clauses may be
>> used on different levels of the join tree, but they are estimated once.
>>
>> I meant we may just refer to the statistic used by its name in the
>> node's explanation if this node contains estimated clause. I use
>> multiple extended statistics. Sometimes, they intersect, and it is hard
>> to say which one was used for a specific clause.
>
>I did mean "part of the plan", i.e. the actual operation. That being
>said, maybe you're right the exact node doesn't matter all that much,
>and it'd be better to have a separate list of stats for the whole plan.
>
>That'd probably work better for stuff like estimate_num_groups(), which
>is hard to assign to a particular operation ...

The patch currently under development aims to display the actual usage of
extended statistics on a plan-by-plan basis.
I thought that being able to check the extended statistics used by plan
nodes
in a plan would make it easier to tune plans using extended statistics.

On the other hand, I also think that a mechanism to check the usage of
extended statistics on a DB-by-DB basis would be useful.
I talked about this idea in my talk at PGConf.dev 2024.
This is similar to pg_stat_user_indexes for indexes. If you can find unused
or
infrequently used extended statistics, you can delete unnecessary ones.
However, this functionality is not included in this patch. I plan to
provide it as
a separate patch after the current patch is committed.

>>> I may be wrong, but I don't quite see why would that be significantly
>>> easier to implement. You still need to track the information somewhere,
>>> and it probably needs for individual Path nodes. Because while building
>>> the plan you don't know which paths will get selected.
>> In my mind, gathering extended statistics usage data may be implemented
>> like it is already done for many parameters in the stat collector. It
>> may help identify whether statistics are used frequently or not.
>
>Isn't the statistics collector dealing with a completely different use
>case? Sure, maybe it would be interesting to track how many queries used
>which statistics object / for what purpose. A bit like we track counts
>for indexes etc.
>
>But that seems very different from what this patch aims to do, which is
>to track info about how a particular query used extended stats.
>
>I suppose we could have a "buffer" to remember which statistics objects
>were used by a single statement/transaction, just like we do for tables
>or indexes. But I still think we should show clauses estimated by each
>statistics object in a given query, and this would not help with that.

Ideally, it would be nice to be able to show which extended statistics
were used for which clauses. However, based on the discussion so far,
this seems difficult to implement.

A second best solution would be to add a summary of the extended statistics
used at the end of the plan. This alone would make it easier for users to
use the extended statistics.

The expected output is as follows:
- Do not output "Clauses: ".
- Display only the names of the extended statistics used.
- If displaying the names of multiple extended statistics,
should they be joined by commas? (Or should Ext Stats: be displayed
on multiple lines?)

e.g.,
explain (stats) select 1 from t where a = 1 and b =1;

## Output from the current patch
QUERY PLAN
--------------------------------------------------------
Seq Scan on t (cost=0.00..1943.00 rows=10210 width=4)
Filter: ((a = 1) AND (b = 1))
Ext Stats: s Clauses: ((a = 1) AND (b = 1))
(3 rows)

## Output for displaying summary extended statistics
QUERY PLAN
--------------------------------------------------------
Seq Scan on t (cost=0.00..1943.00 rows=10210 width=4)
Filter: ((a = 1) AND (b = 1))
Planning:
Ext Stats: s <== Is it acceptable?

If we do not display "Clause" information, I believe that there is
no need to deparse clauses, and therefore no need to improve the deparse
function to enable recursive traversal of expression trees.

To further the discussion, I will modify the patch to display the summary
as shown above and send it to -hackers.

P.S.
I have tried to understand the discussion correctly, but please
let me know if I have misunderstood anything.

Regards,
Tatsuro Yamada

In response to

Re: Showing applied extended statistics in explain Part 2 at 2025-02-10 15:56:45 from Tomas Vondra

Responses

Re: Showing applied extended statistics in explain Part 2 at 2025-02-12 11:11:36 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Peter Smith	2025-02-12 05:05:29	TAP test command_fails versus command_fails_like
Previous Message	Ashutosh Bapat	2025-02-12 04:47:07	Re: NOT ENFORCED constraint feature