Quick Links

Perfomance of IN-clause with many elements and possible solutions

From:	"dilaz03 (dot)" <dilaz03(at)gmail(dot)com>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Perfomance of IN-clause with many elements and possible solutions
Date:	2017-07-23 11:35:24
Message-ID:	CALUfWrYp0YFHJ4Rijtb0z_ktYyy51N5N_jsqz0or8LmCWFu=HA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Hello.

I have database with events with type from different souces identified by
id. I have query which filters events by IN-clause with many ids (1-500
ids). I see poor perfomance of IN-clause and try to investigate this
problem.

SELECT version();
version
-------------------------------------------------------------------------------------------------------------------
PostgreSQL 10beta2 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609, 64-bit

-- Full table can fit in memory
show shared_buffers;
shared_buffers
----------------
2GB

show work_mem;
work_mem
----------
16MB

SET max_parallel_workers_per_gather TO 0;
SET max_parallel_workers TO 0;

-- Create table with 10 000 000 rows with 500 bigints
CREATE TABLE ids AS SELECT trunc(random() * 500)::bigint as id from
generate_series(1, 10000000);

-- IN (...)
SELECT ('(' || string_agg(id::text, ',') || ')') AS in_clause
FROM (SELECT id FROM ids GROUP BY id ORDER BY id LIMIT 200) AS s \gset

EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) FROM ids WHERE id IN :in_clause;

Aggregate (cost=2654265.02..2654265.03 rows=1 width=8) (actual
time=17268.831..17268.831 rows=1 loops=1)
Buffers: shared hit=44248
-> Seq Scan on ids (cost=0.00..2644260.48 rows=4001815 width=0)
(actual time=0.066..16722.072 rows=3998646 loops=1)
Filter: (id = ANY
('{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199}'::bigint[]))
Rows Removed by Filter: 6001354
Buffers: shared hit=44248
Planning time: 3.324 ms
Execution time: 17268.907 ms

-- IN (VALUES ...)
SELECT ('(VALUES ' || string_agg('(' || id::text || ')', ',') || ')') AS
values_clause
FROM (SELECT id FROM ids GROUP BY id ORDER BY id LIMIT 200) AS s \gset

EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) FROM ids WHERE id IN
:values_clause;
Aggregate (cost=245006.46..245006.47 rows=1 width=8) (actual
time=4086.188..4086.188 rows=1 loops=1)
Buffers: shared hit=44248
-> Hash Join (cost=7.50..235006.42 rows=4000019 width=0) (actual
time=0.978..3557.037 rows=3998646 loops=1)
Hash Cond: (ids.id = "*VALUES*".column1)
Buffers: shared hit=44248
-> Seq Scan on ids (cost=0.00..144248.48 rows=10000048 width=8)
(actual time=0.031..1138.542 rows=10000000 loops=1)
Buffers: shared hit=44248
-> Hash (cost=5.00..5.00 rows=200 width=4) (actual
time=0.923..0.923 rows=200 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 16kB
-> HashAggregate (cost=3.00..5.00 rows=200 width=4)
(actual time=0.606..0.759 rows=200 loops=1)
Group Key: "*VALUES*".column1
-> Values Scan on "*VALUES*" (cost=0.00..2.50
rows=200 width=4) (actual time=0.003..0.330 rows=200 loops=1)
Planning time: 1.094 ms
Execution time: 4086.333 ms

-- '...'::hstore ? id
SELECT ('''' || string_agg(id::text || '=>NULL', ',') || '''::hstore') AS
hstore_clause
FROM (SELECT id FROM ids GROUP BY id ORDER BY id LIMIT 200) AS s \gset

EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) FROM ids WHERE :hstore_clause ?
id::text;
Planning time: 0.206 ms
Execution time: 5032.794 ms

-- '...'::jsonb ? id
SELECT ('''{' || string_agg('"' || id::text || '": null', ',') ||
'}''::jsonb') AS jsonb_clause
FROM (SELECT id FROM ids GROUP BY id ORDER BY id LIMIT 200) AS s \gset

EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) FROM ids WHERE :jsonb_clause ?
id::text;
Planning time: 0.114 ms
Execution time: 9277.307 ms

IN-VALUES clause has the bestest perfomance. So I have some questions:

- May be exist better solution?
- Does PostgreSQL have support of hashset structure? Extension (I don't
found)?
- IN-VALUES clause adds new node to plan. Has additional node big overhead?
How about filter by two or more IN-VALUES clause?

Thanks.

Responses

Re: Perfomance of IN-clause with many elements and possible solutions at 2017-07-23 22:40:31 from PT
Re: Perfomance of IN-clause with many elements and possible solutions at 2017-07-24 21:31:53 from David G. Johnston
Re: Perfomance of IN-clause with many elements and possible solutions at 2017-07-26 19:14:21 from Dmitry Lazurkin

Browse pgsql-general by date

	From	Date	Subject
Next Message	Tom Lane	2017-07-23 15:48:49	Re: pg_dump not dropping event trigger
Previous Message	Greg Atkins	2017-07-23 06:59:16	Re: pg_dump not dropping event trigger