Re: alternate idioms for large "IN (...)" lists

From: russm <russm(at)icorp(dot)com(dot)au>
To: <pgsql-sql(at)postgresql(dot)org>
Subject: Re: alternate idioms for large "IN (...)" lists
Date: 2002-06-03 06:45:36
Message-ID: B9214DB0.41B0%russm@icorp.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

I've been playing with other ways of structuring the query to try and make
the planner use a hash join, and am currently looking at replacing

SELECT attribute_id, COUNT(asset_id)
FROM attribute_map
WHERE asset_id IN ( 564,3789, ... ,5240)

with

SELECT attribute_id, COUNT(asset_id)
FROM attribute_map m, (
SELECT 564 AS a UNION
SELECT 3789 AS a UNION
...
SELECT 5240 AS a
) a
WHERE m.asset_id = a.a

which produces a hash join between attribute_map and the list of asset_id
values, and runs in ~7% the time of the "IN (...)" version of the query.
I'll need to see how it scales to our full dataset, but this may be enough
of an improvement to get us by. The downside is that the queries I'm passing
to the DB are bloated by a factor of 5 - I suspect I can live with this.

Christopher suggested I rewrite the query in terms of EXISTS, but can't
figure out how to do that in a way that improves the performance of this
query - am I missing something here?

Just out of curiosity, why does postgresql's IN perform so poorly? Is it
because IN is implemented in terms of ANY? I would expect that the best
approach for an IN expression is a hash join, but that only works for = and
!=, and not any of the other operators that ANY supports.

cheers

Russell

On 3/6/02 4:01 AM, "Christopher Kings-Lynne" <chriskl(at)familyhealth(dot)com(dot)au>
wrote:

> Yep, Postgresql's IN implementation is very slow. You are far better off
> rewriting your query to use EXISTS or an OUTER JOIN.
>
> See:
> http://www3.us.postgresql.org/users-lounge/docs/7.2/postgres/functions-subqu
> ery.html
>
> eg.
>
> SELECT attribute_id, COUNT(asset_id)
> FROM attribute_map
> WHERE EXISTS (SELECT asset_id FROM <potentially several thousand asset_id
> values> WHERE asset_id=...)
> GROUP BY attribute_id
> HAVING COUNT(asset_id) < <cutoff value>
> ORDER BY count DESC
> LIMIT <limit>
>
> or something like:
>
> SELECT attribute_id, COUNT(asset_id)
> FROM attribute_map atm RIGHT JOIN assets_map aam
> ON atm.asset_id=asm.asset_id
> GROUP BY....blah blah blah...
>
> Play around with these alternative syntaxes and stay away from putting
> thousands of rows in IN() and you will be able to make your query run orders
> of magnitude faster.
>
> Chris
>
> ----- Original Message -----
> From: "russm" <russm(at)icorp(dot)com(dot)au>
> To: <pgsql-sql(at)postgresql(dot)org>
> Sent: Saturday, June 01, 2002 11:10 PM
> Subject: [SQL] alternate idioms for large "IN (...)" lists
>
>
>> I'm working on an application (originally written for Oracle) that is
> being
>> brought to it's knees by one commonly-executed query - finding the most
>> common attributes for a given list of assets...
>>
>>
>> SELECT attribute_id, COUNT(asset_id)
>> FROM attribute_map
>> WHERE asset_id IN ( <potentially several thousand asset_id values> )
>> GROUP BY attribute_id
>> HAVING COUNT(asset_id) < <cutoff value>
>> ORDER BY count DESC
>> LIMIT <limit>
>>
>>
>> where attribute_map is a many-to-many map of asset_id to attribute_id -
>>
>> demo=# \d attribute_map
>> Table "attribute_map"
>> Column | Type | Modifiers
>> --------------+---------+-----------
>> asset_id | integer |
>> attribute_id | integer |
>> Indexes: am_asset_id_idx,
>> am_attribute_id_idx
>> Unique keys: am_asset_id_attribute_id_un
>> Triggers: RI_ConstraintTrigger_26912,
>> RI_ConstraintTrigger_26906
>>
>>
>>> From what i've read here, postgresql doesn't handle IN (...) queries
>> especially efficiently, and it looks to me like I'm being bitten by that.
> An
>> EXPLAIN ANALYZE on that query with just under 40k rows in attribute_map
> and
>> 667 asset_id values in the IN (...) list returns
>>
>> NOTICE: QUERY PLAN:
>>
>> Limit (cost=64361.51..64361.51 rows=200 width=8) (actual
>> time=24431.51..24431.80 rows=200 loops=1)
>> -> Sort (cost=64361.51..64361.51 rows=1647 width=8) (actual
>> time=24431.50..24431.61 rows=201 loops=1)
>> -> Aggregate (cost=0.00..64273.49 rows=1647 width=8) (actual
>> time=55.29..24324.75 rows=1747 loops=1)
>> -> Group (cost=0.00..64191.13 rows=16473 width=8) (actual
>> time=2.38..24198.18 rows=21308 loops=1)
>> -> Index Scan using am_attribute_id_idx on
>> attribute_map (cost=0.00..64149.94 rows=16473 width=8) (actual
>> time=2.37..24034.97 rows=21308 loops=1)
>> Total runtime: 24433.20 msec
>>
>>
>> which I'm assuming means that the backend is doing a seperate index lookup
>> for each of the 667 asset_id values in the list.
>>
>> Are there any standard idioms in postgres for getting around the poor
>> handling of IN (...) lists? Placing the list of asset_id values in a table
>> and then joining against that runs in about 1/15 the time (postgres makes
> a
>> hash of the asset_id values and then hash joins against the attribute_map)
>> but since my asset_id values come from outside the database that approach
>> would require creating and managing a whole lot of temporary tables, which
>> i'd rather avoid.
>>
>> Much obliged for any suggestions
>>
>> Russell
>>
>>
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 3: if posting/reading through Usenet, please send an appropriate
>> subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
>> message can get through to the mailing list cleanly
>>
>
>

In response to

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Andre Schubert 2002-06-03 08:00:28 How to update
Previous Message Bruce Momjian 2002-06-03 04:00:12 Re: Syntax error in plpgsql crashes backend