Re: using extended statistics to improve join estimates

From: Andrei Lepikhov <lepihov(at)gmail(dot)com>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: Julien Rouhaud <rjuju123(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Justin Pryzby <pryzby(at)telsasoft(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Andy Fan <zhihuifan1213(at)163(dot)com>
Subject: Re: using extended statistics to improve join estimates
Date: 2024-09-04 14:50:26
Message-ID: 7e90cd62-9fea-467a-882e-3afacf8723e5@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3/9/2024 14:58, Andrei Lepikhov wrote:
> On 17/6/2024 18:10, Tomas Vondra wrote:
> x = $1 AND y = $2 AND ...
> As I see, current patch doesn't resolve this issue currently.
Let's explain my previous argument with an example (see in attachment).

The query designed to be executed with parameterised NL join:

EXPLAIN (ANALYZE, TIMING OFF)
SELECT * FROM test t1 NATURAL JOIN test1 t2 WHERE t2.x1 < 1;

After applying the topmost patch from the patchset we can see two
different estimations (explain tuned a little bit) before and after
extended statistics:

-- before:

Nested Loop (rows=1) (actual rows=10000 loops=1)
-> Seq Scan on test1 t2 (rows=100) (actual rows=100 loops=1)
Filter: (x1 < 1)
-> Memoize (rows=1) (actual rows=100 loops=100)
Cache Key: t2.x1, t2.x2, t2.x3, t2.x4
-> Index Scan using test_x1_x2_x3_x4_idx on test t1
(rows=1 width=404) (actual rows=100 loops=1)
Index Cond: ((x1 = t2.x1) AND (x2 = t2.x2) AND
(x3 = t2.x3) AND (x4 = t2.x4))

-- after:

Nested Loop (rows=10000) (actual rows=10000 loops=1)
-> Seq Scan on test1 t2 (rows=100) (actual rows=100 loops=1)
Filter: (x1 < 1)
-> Memoize (rows=1) (actual rows=100 loops=100)
Cache Key: t2.x1, t2.x2, t2.x3, t2.x4
-> Index Scan using test_x1_x2_x3_x4_idx on test t1 (rows=1)
(actual rows=100 loops=1)
Index Cond: ((x1 = t2.x1) AND (x2 = t2.x2) AND
(x3 = t2.x3) AND (x4 = t2.x4))

You can see, that index condition was treated as join clause and PNL
estimated correctly by an MCV on both sides.
But scan estimation is incorrect.
Moreover, sometimes we don't have MCV at all. And the next step for this
patch should be implementation of bare estimation by the only ndistinct
on each side.

What to do with the scan filter? Not sure so far, but it looks like here
may be used the logic similar to var_eq_non_const().

--
regards, Andrei Lepikhov

Attachment Content-Type Size
example.sql application/sql 2.0 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2024-09-04 15:12:48 Re: scalability bottlenecks with (many) partitions (and more)
Previous Message David Rowley 2024-09-04 14:49:51 Re: Optimize WindowAgg's use of tuplestores