Optimizing NOT IN plans / verify rewrite

From: Maciek Sakrejda <msakrejda(at)truviso(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Optimizing NOT IN plans / verify rewrite
Date: 2010-08-02 19:12:51
Message-ID: AANLkTim0entqazUpWhwotC+gcDGrmeM=a=WPJUv=MgdO@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi,

I'm running PostgreSQL 8.3 and I have a query with a couple of NOT IN
subqueries:

DELETE FROM foo WHERE type = 'o' AND b NOT IN (SELECT cqc.b FROM bar
cqc) AND b NOT IN (SELECT car.b FROM foo car WHERE car.type != 'o');

The plan produced for this is:

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Index Scan using foo_type_index on foo (cost=17851.93..1271633830.75
rows=66410 width=6)
Index Cond: (type = 'o'::bpchar)
Filter: ((NOT (subplan)) AND (NOT (subplan)))
SubPlan
-> Materialize (cost=6077.87..10238.57 rows=299170 width=8)
-> Seq Scan on bar cqc (cost=0.00..4609.70 rows=299170 width=8)
-> Materialize (cost=11774.06..15728.45 rows=284339 width=8)
-> Seq Scan on foo car (cost=0.00..10378.73 rows=284339 width=8)
Filter: (type <> 'o'::bpchar)
(9 rows)

Unfortunately, when these tables get large-ish, the materilzations get
really expensive to re-scan for every tuple (see cost above). At the
moment, I have ~500k rows in foo and ~300k rows in bar. The
selectivity of type = 'o' is ~50%. I've tried to re-write the query as
follows:

DELETE FROM foo WHERE b IN (SELECT candidate_run.type_o_run as b FROM
(SELECT cqar1.b AS type_o_run, cqar2.b AS non_type_o_run FROM foo
cqar1 LEFT OUTER JOIN foo cqar2 ON (cqar1.b = cqar2.b AND cqar2.type
!= 'o') WHERE cqar1.type = 'o') candidate_run LEFT OUTER JOIN bar ON
(candidate_run.type_o_run = bar.b) WHERE non_type_o_run IS NULL AND
bar.b IS NULL);

This gives the more sensible plan:

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
Hash IN Join (cost=48999.81..71174.41 rows=66410 width=6)
Hash Cond: (foo.b = cqar1.b)
-> Seq Scan on foo (cost=0.00..9003.78 rows=549978 width=14)
-> Hash (cost=47909.68..47909.68 rows=66410 width=8)
-> Hash Left Join (cost=24562.29..47909.68 rows=66410 width=8)
Hash Cond: (cqar1.b = bar.b)
Filter: (bar.b IS NULL)
-> Hash Left Join (cost=15043.96..33635.58 rows=132820 width=8)
Hash Cond: (cqar1.b = cqar2.b)
Filter: (cqar2.b IS NULL)
-> Seq Scan on foo cqar1 (cost=0.00..10378.73
rows=265639 width=8)
Filter: (type = 'o'::bpchar)
-> Hash (cost=10378.73..10378.73 rows=284339 width=8)
-> Seq Scan on foo cqar2
(cost=0.00..10378.73 rows=284339 width=8)
Filter: (type <> 'o'::bpchar)
-> Hash (cost=4609.70..4609.70 rows=299170 width=8)
-> Seq Scan on bar (cost=0.00..4609.70
rows=299170 width=8)
(17 rows)

As far as I can tell, the results are identical.

My questions

1. Is my rewrite valid?
2. Any way to reliably achieve the second plan (or really, any plan
that doesn't rescan ~~500k tuples per each of ~250k tuples) by
tweaking (per-session) planner constants? I've played with this a
little, but without much success. As with any rewrite situation, I'd
prefer to stick with the simpler, more explicit original query.

Here is a SQL script to reproduce the problem:

\set ON_ERROR_STOP

drop schema if exists not_in_test cascade;
create schema not_in_test;

set search_path to not_in_test;

create table foo (
a oid not null,
b bigint not null,
type char not null,
ts timestamp without time zone not null
);
create index "foo_b_a_type_index" on foo (b, a, type);
create index "foo_a_index" on foo (a);
create index "foo_type_index" on foo(type);

create table bar (
b bigint unique not null,
c timestamp with time zone not null
);
create index "bar_b_index" on bar(b);

insert into foo select (random()*10)::integer,
generate_series(1,550000), case when random() > 0.5 then 'o' else 'x'
end, now();
insert into bar select val, now() from generate_series(1,1200000) as
vals(val) where random() > 0.75;

analyze foo;
analyze bar;

EXPLAIN DELETE FROM foo WHERE type = 'o' AND b NOT IN (SELECT cqc.b
FROM bar cqc) AND b NOT IN (SELECT car.b FROM foo car WHERE car.type
!= 'o');
EXPLAIN DELETE FROM foo WHERE b IN (SELECT candidate_run.type_o_run as
b FROM (SELECT cqar1.b AS type_o_run, cqar2.b AS non_type_o_run FROM
foo cqar1 LEFT OUTER JOIN foo cqar2 ON (cqar1.b = cqar2.b AND
cqar2.type != 'o') WHERE cqar1.type = 'o') candidate_run LEFT OUTER
JOIN bar ON (candidate_run.type_o_run = bar.b) WHERE non_type_o_run IS
NULL AND bar.b IS NULL);

Thanks,
---
Maciek Sakrejda | System Architect | Truviso

1065 E. Hillsdale Blvd., Suite 215
Foster City, CA 94404
(650) 242-3500 Main
www.truviso.com

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Kevin Grittner 2010-08-02 19:29:29 Re: Optimizing NOT IN plans / verify rewrite
Previous Message Tom Lane 2010-08-02 17:29:47 Re: what does "initplan" operation in explain output mean?