Quick Links

Re: Optimization idea for long IN() lists

From:	Maxim Boguk <maxim(dot)boguk(at)gmail(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Optimization idea for long IN() lists
Date:	2014-08-09 04:22:00
Message-ID:	CAK-MWwRR80afBo6UHk201J+Q+gz8Wxg9kjE0-PFMtP3j15oZOQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

On Sat, Aug 9, 2014 at 5:15 AM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:

> Folks,
>
> So one thing we tell users who have chronically long IN() lists is that
> they should create a temporary table and join against that instead.
> Other than not having the code, is there a reason why PostgreSQL
> shouldn't do something like this behind the scenes, automatically?
>
>
Hi Josh,

I know that problem for many years.
There are some workaround which doesn't require using the temporary tables
(and I used that approach quite a lot when performance matter):

Instead of using:
SELECT * FROM sometable
WHERE
somefield IN (val1, val2, ...)
AND other_filters;

Query could be written as:
SELECT * FROM sometable
JOIN (VALUES ((val1), (val2) ...)) AS v(somefield) ON
v.somefield=sometable.somefield
WHERE
other_filters;

When there no index on somefield query plans would look like as:

Original query:

Filter: (somefield = ANY ('{...}'::integer[]))

vs optimized query:

Hash Join (cost=0.25..117.89 rows=22 width=59) (actual time=5.332..5.332
rows=0 loops=1)
Hash Cond: (sometable.somefield = "*VALUES*".somefield)
...
-> Hash (cost=0.12..0.12 rows=10 width=4) (actual time=0.010..0.010
rows=10 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
-> Values Scan on "*VALUES*" (cost=0.00..0.12 rows=10 width=4)
(actual time=0.001..0.003 rows=10 loops=1)

In synthetic data I observed the following performance results (fully
in-memory data with integer values):

List length IN Performance JOIN VALUES Performance
10 5.39ms 5.38ms
100 9.74ms 5.49ms
1000 53.02ms 9.89ms
10000 231.10ms 13.14ms

So starting from 10 elements VALUES/HASH JOIN approach is clear winner.
In case of the text literals IN list performance difference even more
obvious (~2 order of magnitude for 10000 list).

However, if IN list used for the primary key lookup - there are no visible
performance difference between these two approaches.

So yes there are some space for optimization of "Filter: (somefield = ANY
('{...}'::integer[]))" via hashing.

--
Maxim Boguk
Senior Postgresql DBA
http://www.postgresql-consulting.ru/ <http://www.postgresql-consulting.com/>

Phone RU: +7 910 405 4718
Phone AU: +61 45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk
Jabber: maxim(dot)boguk(at)gmail(dot)com
МойКруг: http://mboguk.moikrug.ru/

"People problems are solved with people.
If people cannot solve the problem, try technology.
People will then wish they'd listened at the first stage."

In response to

Optimization idea for long IN() lists at 2014-08-08 19:15:36 from Josh Berkus

Browse pgsql-performance by date

	From	Date	Subject
Next Message	楊新波	2014-08-12 02:59:02	how does the planer to estimate row when i use order by and group by
Previous Message	Josh Berkus	2014-08-08 19:15:36	Optimization idea for long IN() lists