alternate idioms for large "IN (...)" lists

From: russm <russm(at)icorp(dot)com(dot)au>
To: <pgsql-sql(at)postgresql(dot)org>
Subject: alternate idioms for large "IN (...)" lists
Date: 2002-06-02 06:10:02
Message-ID: B91FF3DA.40C6%russm@icorp.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

I'm working on an application (originally written for Oracle) that is being
brought to it's knees by one commonly-executed query - finding the most
common attributes for a given list of assets...

SELECT attribute_id, COUNT(asset_id)
FROM attribute_map
WHERE asset_id IN ( <potentially several thousand asset_id values> )
GROUP BY attribute_id
HAVING COUNT(asset_id) < <cutoff value>
ORDER BY count DESC
LIMIT <limit>

where attribute_map is a many-to-many map of asset_id to attribute_id -

demo=# \d attribute_map
Table "attribute_map"
Column | Type | Modifiers
--------------+---------+-----------
asset_id | integer |
attribute_id | integer |
Indexes: am_asset_id_idx,
am_attribute_id_idx
Unique keys: am_asset_id_attribute_id_un
Triggers: RI_ConstraintTrigger_26912,
RI_ConstraintTrigger_26906

From what i've read here, postgresql doesn't handle IN (...) queries
especially efficiently, and it looks to me like I'm being bitten by that. An
EXPLAIN ANALYZE on that query with just under 40k rows in attribute_map and
667 asset_id values in the IN (...) list returns

NOTICE: QUERY PLAN:

Limit (cost=64361.51..64361.51 rows=200 width=8) (actual
time=24431.51..24431.80 rows=200 loops=1)
-> Sort (cost=64361.51..64361.51 rows=1647 width=8) (actual
time=24431.50..24431.61 rows=201 loops=1)
-> Aggregate (cost=0.00..64273.49 rows=1647 width=8) (actual
time=55.29..24324.75 rows=1747 loops=1)
-> Group (cost=0.00..64191.13 rows=16473 width=8) (actual
time=2.38..24198.18 rows=21308 loops=1)
-> Index Scan using am_attribute_id_idx on
attribute_map (cost=0.00..64149.94 rows=16473 width=8) (actual
time=2.37..24034.97 rows=21308 loops=1)
Total runtime: 24433.20 msec

which I'm assuming means that the backend is doing a seperate index lookup
for each of the 667 asset_id values in the list.

Are there any standard idioms in postgres for getting around the poor
handling of IN (...) lists? Placing the list of asset_id values in a table
and then joining against that runs in about 1/15 the time (postgres makes a
hash of the asset_id values and then hash joins against the attribute_map)
but since my asset_id values come from outside the database that approach
would require creating and managing a whole lot of temporary tables, which
i'd rather avoid.

Much obliged for any suggestions

Russell

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Christopher Kings-Lynne 2002-06-02 18:01:27 Re: alternate idioms for large "IN (...)" lists
Previous Message D'Arcy J.M. Cain 2002-06-01 10:35:10 Re: problem with sql