Re: Poor performance on 9.4.4 when matching multiple columns from left side of LATERAL join

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: Steven Grimm <sgrimm(at)thesegovia(dot)com>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Poor performance on 9.4.4 when matching multiple columns from left side of LATERAL join
Date: 2015-11-14 08:32:19
Message-ID: CAKJS1f8baiZHadAcjSvDnccOAhOgtFpMo=1h6YxLafsp97KS3A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 14 November 2015 at 19:25, Steven Grimm <sgrimm(at)thesegovia(dot)com> wrote:

>
> Execution plan for the IN version followed by the = version (for just one
> of the IDs):
>
> ---------------------------------------------
> Nested Loop (cost=5.39..8107.18 rows=285 width=4) (actual
> time=1.230..6456.567 rows=4499 loops=1)
> Join Filter: (settings.owner_id = ANY (ARRAY[mid.id1, mid.id2,
> mid.id3]))
> Rows Removed by Join Filter: 22495501
> -> Seq Scan on multi_id mid (cost=0.00..78.00 rows=5000 width=12)
> (actual time=0.010..1.385 rows=5000 loops=1)
> -> Materialize (cost=5.39..310.66 rows=95 width=4) (actual
> time=0.000..0.263 rows=4500 loops=5000)
> -> Bitmap Heap Scan on settings (cost=5.39..310.19 rows=95
> width=4) (actual time=1.207..3.210 rows=4500 loops=1)
> Recheck Cond: ((setting_id = 1) AND (setting_value =
> 'common_1'::text))
> Heap Blocks: exact=1405
> -> Bitmap Index Scan on settings_idx_setting_id
> (cost=0.00..5.37 rows=95 width=0) (actual time=0.930..0.930 rows=4500
> loops=1)
> Index Cond: ((setting_id = 1) AND (setting_value =
> 'common_1'::text))
> Planning time: 0.178 ms
> Execution time: 6456.897 ms
>
>
> Hash Join (cost=145.98..472.93 rows=103 width=4) (actual
> time=2.677..6.890 rows=4500 loops=1)
> Hash Cond: (settings.owner_id = mid.id1)
> -> Bitmap Heap Scan on settings (cost=5.48..330.50 rows=103 width=4)
> (actual time=1.194..3.477 rows=4500 loops=1)
> Recheck Cond: ((setting_id = 1) AND (setting_value =
> 'common_1'::text))
> Heap Blocks: exact=1405
> -> Bitmap Index Scan on settings_idx_setting_id
> (cost=0.00..5.45 rows=103 width=0) (actual time=0.854..0.854 rows=4500
> loops=1)
> Index Cond: ((setting_id = 1) AND (setting_value =
> 'common_1'::text))
> -> Hash (cost=78.00..78.00 rows=5000 width=4) (actual
> time=1.463..1.463 rows=5000 loops=1)
> Buckets: 1024 Batches: 1 Memory Usage: 176kB
> -> Seq Scan on multi_id mid (cost=0.00..78.00 rows=5000
> width=4) (actual time=0.007..0.717 rows=5000 loops=1)
> Planning time: 0.311 ms
> Execution time: 7.166 ms
> ---------------------------------------------
>
> What am I doing wrong in the IN version of the query, if anything?
>
>
In order to allow hash or merge joins, there needs to be at least 1 common
equality expression in the join condition. Your IN query does not have this.

Notice that with your IN() query that the IN() clause is evaluated in the
nested loop's join condition:

Join Filter: (settings.owner_id = ANY (ARRAY[mid.id1, mid.id2, mid.id3]))

If you think of what a nested loop is, basically for each outer row it must
scan each inner row until it finds a match. Since this is a SEMI join, it
can stop once the first match is found, and then move on to the next outer
row. It then goes and does all that again with the next outer row until
there's no outer rows left. This is why the IN version is so slow. In you
case 22495501 + 4499 comparisons of that join condition have been made,
which is probably not too unreasonable for 6.5 seconds.

The problem is that the optimizer is unable to use hash join or merge joins
when you have the IN() condition as the join condition, the reason for this
is that you're effectively saying to join on any of 3 conditions:
settings.owner_id = mid.id1 OR settings.owner_id = mid.id2 OR
settings.owner_id = mid.id3. If you think how a hash join works then 1 hash
table is no good here, as you effectively have 3 possible keys for the hash
table, the executor would have to build 3 tables to make that possible, but
we only ever build 1 in PostgreSQL. As you may know, a hash table is a very
efficient data structure for key value lookups, but there can only be a
single key. Merge join has the same problem because it's only possible to
have a single sort order.

You might think that all this sounds terrible, but the main reason that
you've got this issue is down to the design of the multi_id table. If you
just had a single ID and perhaps another column to store the 1,2,3 (the
need for this is not quite clear to me), also perhaps your PK to reference
some other table, you could just write the query as:

select id from multi_id where id in(select owner_id from settings
where setting_id
= 1 and setting_value = 'common_1');

Quite possibly this would produce a plan with a hash join and execute very
quickly.

With your current schema the only option that I can see is to build a query
such as:

select id1 from multi_id where id1 in (select owner_id from settings
where setting_id
= 1 and setting_value = 'common_1')
union
select id2 from multi_id where id2 in (select owner_id from settings
where setting_id
= 1 and setting_value = 'common_1')
union
select id3 from multi_id where id3 in (select owner_id from settings
where setting_id
= 1 and setting_value = 'common_1')

but then you're suffering from having to perhaps build the same hash table
3 times, and also the overhead of the UNION getting rid of duplicate rows.
You could perhaps make it slightly better with a common table expression
such as:

with cte (select owner_id from settings where setting_id = 1 and setting_value
= 'common_1') as
select id1 from multi_id where id1 in (select owner_id from cte)
union
select id2 from multi_id where id2 in (select owner_id from cte)
union
select id3 from multi_id where id3 in (select owner_id from cte);

but you still have the union overhead.

--
David Rowley http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Thomas Kellerer 2015-11-14 09:30:16 Re: Poor performance on 9.4.4 when matching multiple columns from left side of LATERAL join
Previous Message Steven Grimm 2015-11-14 07:11:26 Re: Poor performance on 9.4.4 when matching multiple columns from left side of LATERAL join