Quick Links

Re: Showing applied extended statistics in explain Part 2

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Andrei Lepikhov <lepihov(at)gmail(dot)com>, Tatsuro Yamada <yamatattsu(at)gmail(dot)com>, Masahiro(dot)Ikeda(at)nttdata(dot)com
Cc:	tomas(dot)vondra(at)enterprisedb(dot)com, tatsuro(dot)yamada(at)ntt(dot)com, pgsql-hackers(at)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Showing applied extended statistics in explain Part 2
Date:	2025-02-10 15:56:45
Message-ID:	f0cd4b5e-8f7d-4607-a64a-3bf5d2e861e5@vondra.me
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 2/10/25 10:09, Andrei Lepikhov wrote:
> On 8/2/2025 20:50, Tomas Vondra wrote:
>>
>>
>> On 1/24/25 11:17, Andrei Lepikhov wrote:
>>> On 11/1/24 12:22, Tatsuro Yamada wrote:
>>> I often use SQL Server to compare execution plans generated by
>>> PostgreSQL, and I appreciate how they display the usage of extended
>>> statistics. They clearly identify which statistics were utilized during
>>> query planning and include them in the summary section of query plan.
>>>
>> I admit not knowing what exactly SQL Server shows in the explain, but it
>> seems helpful to know which part of the plan used which statistics, no?
>
> I wonder if you meant clauses, not a part of the plan. Clauses may be
> used on different levels of the join tree, but they are estimated once.
>
> I meant we may just refer to the statistic used by its name in the
> node's explanation if this node contains estimated clause. I use
> multiple extended statistics. Sometimes, they intersect, and it is hard
> to say which one was used for a specific clause.

I did mean "part of the plan", i.e. the actual operation. That being
said, maybe you're right the exact node doesn't matter all that much,
and it'd be better to have a separate list of stats for the whole plan.

That'd probably work better for stuff like estimate_num_groups(), which
is hard to assign to a particular operation ...

>> Imagine we only knew an index was used, but not which node used it and
>> for what keys. That would be a bit ... useless.
> Sure, but I am suspicious about the strong necessity to show a specific
> clause estimated by the statistic. It is almost obvious because of the
> simple choosing algorithm.

You could say the same thing about indexes, The process of picking an
index is fairly straightforward, same as determining keys and filters
for the index scan. Yet we still show all of this in the plan. The
algorithm may be simple, but the outcomes are far from obvious. And we
may also enhance/rework the algorithm in the future ... I don't think we
should require intimate knowledge of these details from users.

Also, a couple paragraphs back you wrote:

Sometimes, they intersect, and it is hard to say which one was used
for a specific clause.

Doesn't that really contradict your claim that it's "almost obvious"?

>>
>> Or what info does the SQL server include in the plan, exactly? Can you
>> share an example?
> I wouldn't say SQL Server designed it ideally. In an XML file, you can
> find something like this:
>
> <StatisticsInfo LastUpdate="2024-09-09T21:55:04.43"
> ModificationCount="0" SamplingPercent="17.9892"
> Statistics="[_WA_Sys_00000001_0A55DF1D]" Table="[_InfoRgX]"
> Schema="[dbo]" Database="[DB]"></StatisticsInfo>
>
> The main profit here - you see all the stats involved in estimations
> (and their state), even if final plan doesn't contain estimated stuff at
> all.

OK, that seems very underwhelming. I still think we should show which
clauses were estimated using which statistics object.

>>
>>> I find this method much easier to implement, as it allows us to see any
>>> usage points - remember that `estimate_num_groups` may be applied in
>>> multiple places and may not always correspond to a node clause.
>>>
>>
>> I may be wrong, but I don't quite see why would that be significantly
>> easier to implement. You still need to track the information somewhere,
>> and it probably needs for individual Path nodes. Because while building
>> the plan you don't know which paths will get selected.
> In my mind, gathering extended statistics usage data may be implemented
> like it is already done for many parameters in the stat collector. It
> may help identify whether statistics are used frequently or not.

Isn't the statistics collector dealing with a completely different use
case? Sure, maybe it would be interesting to track how many queries used
which statistics object / for what purpose. A bit like we track counts
for indexes etc.

But that seems very different from what this patch aims to do, which is
to track info about how a particular query used extended stats.

I suppose we could have a "buffer" to remember which statistics objects
were used by a single statement/transaction, just like we do for tables
or indexes. But I still think we should show clauses estimated by each
statistics object in a given query, and this would not help with that.

>>
>> Although, maybe the selectivity estimation is sufficiently independent
>> from the exact paths? In which case we might track it at the plan level.
> Not only that argument. Usually, when I request a DBMS about an EXPLAIN,
> my typical desire is to realise which plan we have and why the optimiser
> has skipped another, maybe even more optimal, plan. I wonder if people
> frequently have the same intention.
> That means you will never see clauses (and their estimations) that were
> kicked off the plan - ppi_clauses of a parameterised path, for example.
> And it may hide the real mechanic that caused skipping the better plan.
> Mentioning all statistics involved in the summary may reveal such info.
>

I don't quite agree with this argument. Explain is meant to describe one
particular plan, not why some other plans were not selected. So I don't
think we should be adding details about estimating clauses from some
other alternative plan. We're not showing info about costing for other
indexes either, for example.

Also, if we started adding more and more statistics, that'd be just an
extra argument against the "almost obvious" claim. Because not only
you'd have to deduce which clauses were estimated by each object, you'd
also first have to figure out which statistics were used for the plan.

>>
>> The point about estimate_num_groups is good - I think there will be more
>> cases where linking the extended statistics to a clause will be hard.
>> But maybe let's not block the whole patch because of that?
> Sure, if the current code doesn't block further improvements of showing
> all statistics involved in the query planning process.
>
> I just wanted to emphasize the key idea: quite frequently we don't see
> in explain stuff that triggered suboptimal plan, because it was
> overestimated and excluded: an index, join, clause ... .
>
> Anyway, thanks for answers, I will discover the code more.
>

Thanks for the discussion.

I wonder what Yamada-san thinks about these suggestions ... He's the one
actually developing the patch, so I'd like to know his opinions.

regards

--
Tomas Vondra

In response to

Re: Showing applied extended statistics in explain Part 2 at 2025-02-10 09:09:53 from Andrei Lepikhov

Responses

Re: Showing applied extended statistics in explain Part 2 at 2025-02-12 05:02:20 from Tatsuro Yamada
Re: Showing applied extended statistics in explain Part 2 at 2025-04-24 14:31:14 from Andrei Lepikhov

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Nathan Bossart	2025-02-10 16:02:10	Re: describe special values in GUC descriptions more consistently
Previous Message	Matheus Alcantara	2025-02-10 15:32:09	Re: explain analyze rows=%.0f