WHERE IN (subselect) versus WHERE IN (1,2,3,)

From: Kevin Goess <kgoess(at)bepress(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: WHERE IN (subselect) versus WHERE IN (1,2,3,)
Date: 2012-03-15 09:47:03
Message-ID: CABZkbxg3JEi-KAzwgRr34EuyqmKKRTM0H5AyEvqeBaAYekpMbQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

My apologies, I'm sure this question has been asked before but I couldn't
find anything on the list that meant anything to me.

We have a table "contexts" with 1.6 million rows, and a table "articles"
with 1.4 million rows, where an "article" is a particular kind of
"context". We want to select from a join on those two tables like this

SELECT COUNT(*)
FROM contexts
JOIN articles ON (articles.context_key=contexts.context_key)
WHERE contexts.context_key IN (...);
/* and some combination of columns from articles and contexts */

If "IN(...)" is a query, then this guy does a seq scan on the contexts
table, even if the subquery is "select col_a from kgtest" where kgtest has
one row. If however I read the ids beforehand and write them into the
query, a la "IN (111,222,333...)", then the everything is happy, up to at
least 20,000 values written into the sql, at which point smaller machines
will take 2-5 minutes to parse the query.

I can certainly write the ids inline into the SQL, but when I do that I get
the distinct impression that I'm Doing It Wrong. Is this expected
behavior? It seems surprising to me.

To demonstrate:

/* nothing up my sleeve */
# select * from kgtest;
cola
---------
1652729
(1 row)

/* inline, good query plan */
# explain (analyze, buffers) select count(*) from contexts JOIN articles ON
(articles.context_key=contexts.context_key) where contexts.context_key in
(1652729);
QUERY
PLAN
----------------------------------------------------------------------------------------------------------------------
Aggregate (cost=3.82..3.83 rows=1 width=0) (actual time=0.188..0.189
rows=1 loops=1)
Buffers: shared hit=7
-> Nested Loop (cost=0.00..3.81 rows=1 width=0) (actual
time=0.181..0.181 rows=0 loops=1)
Buffers: shared hit=7
-> Index Scan using contexts_pkey on contexts (cost=0.00..1.90
rows=1 width=4) (actual time=0.109..0.112 ro
Index Cond: (context_key = 1652729)
Buffers: shared hit=4
-> Index Scan using articles_pkey on articles (cost=0.00..1.90
rows=1 width=4) (actual time=0.060..0.060 ro
Index Cond: (articles.context_key = 1652729)
Buffers: shared hit=3
Total runtime: 0.324 ms
(11 rows)

/* subselect, query plan does seq scan on contexts */

# explain (analyze, buffers) select count(*)from contexts JOIN articles ON
(articles.context_key=contexts.context_key) where contexts.context_key in
(select cola from kgtest);
QUERY
PLAN
----------------------------------------------------------------------------------------------------------------------
Aggregate (cost=118505.72..118505.73 rows=1 width=0) (actual
time=0.274..0.275 rows=1 loops=1)
Buffers: shared hit=5
-> Hash Join (cost=12512.61..116661.91 rows=737524 width=0) (actual
time=0.269..0.269 rows=0 loops=1)
Hash Cond: (contexts.context_key = articles.context_key)
Buffers: shared hit=5
-> Seq Scan on contexts (cost=0.00..64533.03 rows=1648203
width=4) (actual time=0.009..0.009 rows=1 loops=1
Buffers: shared hit=1
-> Hash (cost=412.56..412.56 rows=737524 width=8) (actual
time=0.110..0.110 rows=0 loops=1)
Buckets: 4096 Batches: 32 Memory Usage: 0kB
Buffers: shared hit=4
-> Nested Loop (cost=40.00..412.56 rows=737524 width=8)
(actual time=0.107..0.107 rows=0 loops=1)
Buffers: shared hit=4
-> HashAggregate (cost=40.00..42.00 rows=200
width=4) (actual time=0.069..0.071 rows=1 loops=1)
Buffers: shared hit=1
-> Seq Scan on kgtest (cost=0.00..34.00
rows=2400 width=4) (actual time=0.048..0.050 rows
Buffers: shared hit=1
-> Index Scan using articles_pkey on articles
(cost=0.00..1.84 rows=1 width=4) (actual time=0.0
Index Cond: (articles.context_key = kgtest.cola)
Buffers: shared hit=3
Total runtime: 0.442 ms

--
Kevin M. Goess
Software Engineer
Berkeley Electronic Press
kgoess(at)bepress(dot)com

510-665-1200 x179
www.bepress.com

bepress: sustainable scholarly publishing

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Richard Harley 2012-03-15 12:15:58 Backups
Previous Message Martin Gregorie 2012-03-15 08:24:59 Re: copy in date string "00-00-00 00:00:00"