Re: Showing applied extended statistics in explain Part 2

From: Andrei Lepikhov <lepihov(at)gmail(dot)com>
To: Tomas Vondra <tomas(at)vondra(dot)me>, Tatsuro Yamada <yamatattsu(at)gmail(dot)com>, Masahiro(dot)Ikeda(at)nttdata(dot)com
Cc: tomas(dot)vondra(at)enterprisedb(dot)com, tatsuro(dot)yamada(at)ntt(dot)com, pgsql-hackers(at)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject: Re: Showing applied extended statistics in explain Part 2
Date: 2025-02-10 09:09:53
Message-ID: c6668d68-dc2b-4093-a912-d5412672d678@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 8/2/2025 20:50, Tomas Vondra wrote:
>
>
> On 1/24/25 11:17, Andrei Lepikhov wrote:
>> On 11/1/24 12:22, Tatsuro Yamada wrote:
>> I often use SQL Server to compare execution plans generated by
>> PostgreSQL, and I appreciate how they display the usage of extended
>> statistics. They clearly identify which statistics were utilized during
>> query planning and include them in the summary section of query plan.
>>
> I admit not knowing what exactly SQL Server shows in the explain, but it
> seems helpful to know which part of the plan used which statistics, no?
I wonder if you meant clauses, not a part of the plan. Clauses may be
used on different levels of the join tree, but they are estimated once.
I meant we may just refer to the statistic used by its name in the
node's explanation if this node contains estimated clause. I use
multiple extended statistics. Sometimes, they intersect, and it is hard
to say which one was used for a specific clause.
> Imagine we only knew an index was used, but not which node used it and
> for what keys. That would be a bit ... useless.
Sure, but I am suspicious about the strong necessity to show a specific
clause estimated by the statistic. It is almost obvious because of the
simple choosing algorithm.
>
> Or what info does the SQL server include in the plan, exactly? Can you
> share an example?
I wouldn't say SQL Server designed it ideally. In an XML file, you can
find something like this:

<StatisticsInfo LastUpdate="2024-09-09T21:55:04.43"
ModificationCount="0" SamplingPercent="17.9892"
Statistics="[_WA_Sys_00000001_0A55DF1D]" Table="[_InfoRgX]"
Schema="[dbo]" Database="[DB]"></StatisticsInfo>

The main profit here - you see all the stats involved in estimations
(and their state), even if final plan doesn't contain estimated stuff at
all.
>
>> I find this method much easier to implement, as it allows us to see any
>> usage points - remember that `estimate_num_groups` may be applied in
>> multiple places and may not always correspond to a node clause.
>>
>
> I may be wrong, but I don't quite see why would that be significantly
> easier to implement. You still need to track the information somewhere,
> and it probably needs for individual Path nodes. Because while building
> the plan you don't know which paths will get selected.
In my mind, gathering extended statistics usage data may be implemented
like it is already done for many parameters in the stat collector. It
may help identify whether statistics are used frequently or not.
>
> Although, maybe the selectivity estimation is sufficiently independent
> from the exact paths? In which case we might track it at the plan level.
Not only that argument. Usually, when I request a DBMS about an EXPLAIN,
my typical desire is to realise which plan we have and why the optimiser
has skipped another, maybe even more optimal, plan. I wonder if people
frequently have the same intention.
That means you will never see clauses (and their estimations) that were
kicked off the plan - ppi_clauses of a parameterised path, for example.
And it may hide the real mechanic that caused skipping the better plan.
Mentioning all statistics involved in the summary may reveal such info.

>
> The point about estimate_num_groups is good - I think there will be more
> cases where linking the extended statistics to a clause will be hard.
> But maybe let's not block the whole patch because of that?
Sure, if the current code doesn't block further improvements of showing
all statistics involved in the query planning process.

I just wanted to emphasize the key idea: quite frequently we don't see
in explain stuff that triggered suboptimal plan, because it was
overestimated and excluded: an index, join, clause ... .

Anyway, thanks for answers, I will discover the code more.

--
regards, Andrei Lepikhov

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2025-02-10 09:15:22 Re: Conflict detection for update_deleted in logical replication
Previous Message Richard Guo 2025-02-10 08:45:38 Re: Fix outdated code comments in nodeAgg.c