Quick Links

Incorrect choice of Nested Loop for a skewed distribution

From:	Oleg Kharin <ok(at)uvadrev(dot)ru>
To:	pgsql-performance(at)lists(dot)postgresql(dot)org
Subject:	Incorrect choice of Nested Loop for a skewed distribution
Date:	2019-09-03 17:47:42
Message-ID:	9df40087-f396-143d-5e4a-da3c97b5e536@uvadrev.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

Hi All,

With PostgreSQL 10 and 11, the planner doesn't use the lists of most
common values to determine the selectivity of "=" for Nested Loop as it
does for a normal inner join in eqjoinsel_inner(). Incorrect choice of a
nested loops join strategy causes poor query performance.
To demonstrate it one can make the following test case:

create table t(f integer not null,g integer not null);
create table u(f integer not null,g integer not null);
create sequence s cache 1000;
insert into t select 0,s from (select nextval('s') as s) as d;
insert into t select 0,s from (select nextval('s') as s) as d;
insert into t select 0,s from (select nextval('s') as s from t,t t1,t
t2) as d;
insert into t select 0,s from (select nextval('s') as s from t,t t1,t
t2,t t3) as d;
insert into t(f,g) select g,f from t;
insert into u select * from t;
create index t_f on t(f);
vacuum analyze;

The columns f and g of both tables t and u have a skewed distribution:
10010 values of 0 and 10010 unique values starting from 1.
Let's see query plan for the join of t and u:

explain analyze select * from t,u where t.f=u.f and t.g=u.g;

                                                  QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.29..7629.95 rows=25055030 width=16) (actual
time=0.042..22540.629 rows=20020 loops=1)
   -> Seq Scan on u (cost=0.00..289.20 rows=20020 width=8) (actual
time=0.011..3.025 rows=20020 loops=1)
   -> Index Scan using t_f on t (cost=0.29..0.36 rows=1 width=8)
(actual time=0.565..1.125 rows=1 loops=20020)
         Index Cond: (f = u.f)
         Filter: (u.g = g)
         Rows Removed by Filter: 5004
Planning Time: 0.394 ms
Execution Time: 22542.639 ms

After dropping the index
drop index t_f;
we obtain much better query plan (without Nested Loop):

                                                   QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
Merge Join (cost=3439.09..442052.26 rows=25055030 width=16) (actual
time=15.708..32.735 rows=20020 loops=1)
   Merge Cond: ((t.f = u.f) AND (t.g = u.g))
   -> Sort (cost=1719.54..1769.59 rows=20020 width=8) (actual
time=8.189..10.189 rows=20020 loops=1)
         Sort Key: t.f, t.g
         Sort Method: quicksort Memory: 1707kB
         -> Seq Scan on t (cost=0.00..289.20 rows=20020 width=8)
(actual time=0.012..2.958 rows=20020 loops=1)
   -> Sort (cost=1719.54..1769.59 rows=20020 width=8) (actual
time=7.510..9.459 rows=20020 loops=1)
         Sort Key: u.f, u.g
         Sort Method: quicksort Memory: 1707kB
         -> Seq Scan on u (cost=0.00..289.20 rows=20020 width=8)
(actual time=0.008..2.748 rows=20020 loops=1)
Planning Time: 0.239 ms
Execution Time: 34.585 ms

Using MCV lists in var_eq_non_const() would prevent from choosing Nested
Loop in such cases.

Regards,
Oleg Kharin

Responses

Re: Incorrect choice of Nested Loop for a skewed distribution at 2019-09-05 00:18:33 from Justin Pryzby

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Fredrik Blomqvist	2019-09-04 17:29:45	Upsert performance considerations (~1 mil/hour)
Previous Message	Merlin Moncure	2019-09-03 17:34:26	Re: UPGRADE TO PG11 CAUSED DEGREDATION IN PERFORMANCE