From: | Andrei Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru> |
---|---|
To: | Andy Fan <zhihuifan1213(at)163(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Julien Rouhaud <rjuju123(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Justin Pryzby <pryzby(at)telsasoft(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: using extended statistics to improve join estimates |
Date: | 2024-05-21 06:46:08 |
Message-ID: | 1a741830-f870-4907-80e1-c8455c8fce1d@postgrespro.ru |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 5/20/24 16:40, Andrei Lepikhov wrote:
> On 20/5/2024 15:52, Andy Fan wrote:
>> + if (clauselist_selectivity_hook)
>> + *return* clauselist_selectivity_hook(root, clauses, ..)
> Of course - library may estimate not all the clauses - it is a reason,
> why I added input/output parameter 'estimatedclauses' by analogy with
> statext_clauselist_selectivity.
Here is a polished and a bit modified version of the hook proposed.
Additionally, I propose exporting the statext_mcv_clauselist_selectivity
routine, likewise dependencies_clauselist_selectivity. This could
potentially enhance the functionality of the PostgreSQL estimation code.
To clarify the purpose, I want an optional, loaded as a library, more
conservative estimation based on distinct statistics. Let's provide (a
bit degenerate) example:
CREATE TABLE is_test(x1 integer, x2 integer, x3 integer, x4 integer);
INSERT INTO is_test (x1,x2,x3,x4)
SELECT x%5,x%7,x%11,x%13 FROM generate_series(1,1E3) AS x;
INSERT INTO is_test (x1,x2,x3,x4)
SELECT 14,14,14,14 FROM generate_series(1,100) AS x;
CREATE STATISTICS ist_stat (dependencies,ndistinct)
ON x1,x2,x3,x4 FROM is_test;
ANALYZE is_test;
EXPLAIN (ANALYZE, COSTS ON, SUMMARY OFF, TIMING OFF)
SELECT * FROM is_test WHERE x1=14 AND x2=14 AND x3=14 AND x4=14;
DROP TABLE is_test CASCADE;
I see:
(cost=0.00..15.17 rows=3 width=16) (actual rows=100 loops=1)
Dependency works great if it is the same for all the data in the
columns. But we get underestimations if we have different laws for
subsets of rows. So, if we don't have MCV statistics, sometimes we need
to pass over dependency statistics and use ndistinct instead.
--
regards,
Andrei Lepikhov
Postgres Professional
Attachment | Content-Type | Size |
---|---|---|
clauselist_selectivity_hook.diff | text/x-patch | 3.7 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Smith | 2024-05-21 06:52:33 | Re: Pgoutput not capturing the generated columns |
Previous Message | Pavel Stehule | 2024-05-21 05:48:27 | Re: broken tables on hot standby after migration on PostgreSQL 16 (3x times last month) |