Re: PostgreSQL NOT IN performance

From: Віталій Тимчишин <tivv00(at)gmail(dot)com>
To:
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: PostgreSQL NOT IN performance
Date: 2008-11-19 15:12:43
Message-ID: 331e40660811190712p416c371t9c9237ac98ab06fb@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

2008/11/19 DANIEL CRISTIAN CRUZ <daniel(dot)cruz(at)sc(dot)senai(dot)br>

> Something weird with your example which doesn't have the same result, see
> row count with explain analyze:
>
My fault. EXCEPT ALL would not work here, so this method with EXCEPT can be
used only when either operation is done on unique key on t1 or result is
going to be made unique.

> cruz=# SELECT version();
> version
> --------------------------------------------------------------------------------------------
> PostgreSQL 8.3.5 on i486-pc-linux-gnu, compiled by GCC gcc-4.3.real (Debian 4.3.2-1) 4.3.2
> (1 registro)
>
> cruz=# EXPLAIN ANALYZE select * from t1 where id not in (select id from t2);
> QUERY PLAN
> ------------------------------------------------------------------------------------------------------------------
> Seq Scan on t1 (cost=1643.00..4928.00 rows=100000 width=4) (actual time=256.687..585.774 rows=73653 loops=1)
> Filter: (NOT (hashed subplan))
> SubPlan
> -> Seq Scan on t2 (cost=0.00..1393.00 rows=100000 width=4) (actual time=0.052..86.867 rows=100000 loops=1)
> Total runtime: 625.471 ms
> (5 registros)
>
> cruz=# EXPLAIN ANALYZE select * from t1 except all (select id from t2);
> QUERY PLAN
> -----------------------------------------------------------------------------------------------------------------------------------------
> SetOp Except All (cost=34469.90..35969.90 rows=30000 width=4) (actual time=2598.574..3663.712 rows=126733 loops=1)
> -> Sort (cost=34469.90..35219.90 rows=300000 width=4) (actual time=2598.550..3178.387 rows=300000 loops=1)
> Sort Key: "*SELECT* 1".id
> Sort Method: external merge Disk: 5864kB
> -> Append (cost=0.00..7178.00 rows=300000 width=4) (actual time=0.037..1026.367 rows=300000 loops=1)
> -> Subquery Scan "*SELECT* 1" (cost=0.00..4785.00 rows=200000 width=4) (actual time=0.035..439.507 rows=200000 loops=1)
> -> Seq Scan on t1 (cost=0.00..2785.00 rows=200000 width=4) (actual time=0.029..161.355 rows=200000 loops=1)
> -> Subquery Scan "*SELECT* 2" (cost=0.00..2393.00 rows=100000 width=4) (actual time=0.107..255.160 rows=100000 loops=1)
> -> Seq Scan on t2 (cost=0.00..1393.00 rows=100000 width=4) (actual time=0.097..110.639 rows=100000 loops=1)
> Total runtime: 3790.831 ms
> (10 registros)
>
> Sometimes I got a better result (on older versions) with this kind of
> query, but in this case it doesn't:
>
> cruz=# EXPLAIN ANALYZE SELECT * FROM t1 LEFT JOIN t2 ON t1.id = t2.id WHERE t2.id IS NULL;
> QUERY PLAN
> -----------------------------------------------------------------------------------------------------------------------
> Merge Right Join (cost=30092.86..35251.53 rows=155304 width=8) (actual time=850.232..1671.091 rows=73653 loops=1)
> Merge Cond: (t2.id = t1.id)
> Filter: (t2.id IS NULL)
> -> Sort (cost=9697.82..9947.82 rows=100000 width=4) (actual time=266.501..372.560 rows=100000 loops=1)
> Sort Key: t2.id
> Sort Method: quicksort Memory: 4392kB
> -> Seq Scan on t2 (cost=0.00..1393.00 rows=100000 width=4) (actual time=0.029..78.087 rows=100000 loops=1)
> -> Sort (cost=20394.64..20894.64 rows=200000 width=4) (actual time=583.699..855.427 rows=273364 loops=1)
> Sort Key: t1.id
> Sort Method: quicksort Memory: 8784kB
> -> Seq Scan on t1 (cost=0.00..2785.00 rows=200000 width=4) (actual time=0.087..155.665 rows=200000 loops=1)
> Total runtime: 1717.062 ms
> (12 registros)
>
>
Yes, your method is even better on 8.3.3 I have. I will try to update to
8.3.5 to see if there was optimizer improvements. You could try increasing
values, say, by 10 in table filling to see if NOT IT will switch to "slow"
version (for me it starts being slow from some magic row count in t2). I
suppose it is the moment it switches from "hashed subplan" to "subplan". For
me for 10000 values it is "hashed subplan" (and it is momentary fast), for
100000 - it is "subplan" and it is sloow.
BTW: Which (memory?) configuration variable can affect such a switch?

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Kranti™ K K Parisa 2008-11-19 15:54:13 Very Urgent : Sequences Problem
Previous Message Віталій Тимчишин 2008-11-19 14:55:18 Re: PostgreSQL NOT IN performance