Re: Showing applied extended statistics in explain Part 2

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Masahiro(dot)Ikeda(at)nttdata(dot)com, yamatattsu(at)gmail(dot)com
Cc: tatsuro(dot)yamada(at)ntt(dot)com, pgsql-hackers(at)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject: Re: Showing applied extended statistics in explain Part 2
Date: 2024-07-18 21:36:11
Message-ID: 6bf0d4f2-7d4e-4f9c-8207-2ad953315a09@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 7/18/24 12:37, Masahiro(dot)Ikeda(at)nttdata(dot)com wrote:
>> Let me share my opinion on those questions ...
> ...>
>> For ndistinct, I think we don't show this because it doesn't go through
>> clauselist_selectivity, which is the only thing I modified in the PoC.
>> But I guess we might improve estimate_num_groups() to track the stats in a similar way,
>> I guess.
>
> Thanks. IIUC, the reason is that it doesn't go through statext_clauselist_selectivity() because
> the number of clauses is one though it goes through clauselist_selectivity().
>

Ah, I see I misunderstood the original report. The query used was

EXPLAIN (STATS, ANALYZE) SELECT * FROM t3
WHERE date_trunc('month', a) = '2020-01-01'::timestamp;

And it has nothing to do with the number of clauses being one neither.

The problem is this estimate is handled by examine_variable() matching
the expression to the "expression" stats, and injecting it into the
variable, so that the clauselist_selectivity() sees these stats.

This would happen even if you build just expression statistics on each
of the date_trunc() calls, and then tried a query with two clauses:

CREATE STATISTICS s4 ON date_trunc('day', a) FROM t3;
CREATE STATISTICS s3 ON date_trunc('month', a) FROM t3;

EXPLAIN SELECT * FROM t3
WHERE date_trunc('month', a) = '2020-01-01'::timestamp
AND date_trunc('day', 'a') = '2020-01-01'::timestamp;

Not sure how to handle this - we could remember when explain_variable()
injects statistics like this, I guess. But do we know that each call to
examine_variable() is for estimation? And do we know for which clause?

>
>>> ERROR: unrecognized node type: 268
>
> Regarding the above error, do "applied_stats" need have the list of "StatisticExtInfo"
> because it's enough to have the list of Oid(stat->statOid) for EXPLAIN output in the current patch?
> change_to_applied_stats_has_list_of_oids.diff is the change I assumed. Do you have any plan to
> show extra information for example "kind" of "StatisticExtInfo"?
>
> The above is just one idea came up with while I read the following comments of header
> of pathnodes.h, and to support copy "StatisticExtInfo" will leads many other nodes to support copy.
> * We don't support copying RelOptInfo, IndexOptInfo, or Path nodes.
> * There are some subsidiary structs that are useful to copy, though.
>

I do think tracking just the OID would work, because we already know how
to copy List objects. But if we want to also track the clauses, we'd
have to keep multiple lists, right? That seems a bit inconvenient.

>
> By the way, I found curios result while I tested with the above patch. It shows same "Ext Stats" twice.
> I think it's expected behavior because the stat is used when estimate the cost of "Partial HashAggregate" and "Group".
> I've shared the result because I could not understand soon when I saw it first time. I think it's better to let users understand
> when the stats are used, but I don't have any idea now.
>
> -- I tested with the example of CREATE STATISTICS documentation.
> psql=# EXPLAIN (STATS, ANALYZE) SELECT date_trunc('month', a), date_trunc('day', a) FROM t3 GROUP BY 1, 2;
> QUERY PLAN
> ---------------------------------------------------------------------------------------------------------------------------------------------
> Group (cost=9530.56..9576.18 rows=365 width=16) (actual time=286.908..287.909 rows=366 loops=1)
> Group Key: (date_trunc('month'::text, a)), (date_trunc('day'::text, a))
> -> Gather Merge (cost=9530.56..9572.53 rows=365 width=16) (actual time=286.904..287.822 rows=498 loops=1)
> Workers Planned: 1
> Workers Launched: 1
> -> Sort (cost=8530.55..8531.46 rows=365 width=16) (actual time=282.905..282.919 rows=249 loops=2)
> Sort Key: (date_trunc('month'::text, a)), (date_trunc('day'::text, a))
> Sort Method: quicksort Memory: 32kB
> Worker 0: Sort Method: quicksort Memory: 32kB
> -> Partial HashAggregate (cost=8509.54..8515.02 rows=365 width=16) (actual time=282.716..282.768 rows=249 loops=2)
> Group Key: date_trunc('month'::text, a), date_trunc('day'::text, a)
> Batches: 1 Memory Usage: 45kB
> Worker 0: Batches: 1 Memory Usage: 45kB
> -> Parallel Seq Scan on t3 (cost=0.00..6963.66 rows=309177 width=16) (actual time=0.021..171.214 rows=262800 loops=2)
> Ext Stats: public.s3 Clauses: date_trunc('month'::text, a), date_trunc('day'::text, a) -- here
> Ext Stats: public.s3 Clauses: date_trunc('month'::text, a), date_trunc('day'::text, a) -- here
> Planning Time: 114327.206 ms
> Execution Time: 288.007 ms
> (18 rows)
>

I haven't looked into this, but my guess would be this is somehow
related to the parallelism - there's one parallel worker, which means we
have 2 processes to report stats for (leader + worker). And you get two
copies of the "Ext Stats" line. Could be a coincidence, ofc, but maybe
there's a loop to print some worker info, and you print the statistics
info in it?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2024-07-18 21:48:24 Re: add function argument names to regex* functions.
Previous Message John H 2024-07-18 21:22:08 Re: Allow logical failover slots to wait on synchronous replication