From: | Maciek Sakrejda <msakrejda(at)truviso(dot)com> |
---|---|
To: | pgsql-performance(at)postgresql(dot)org |
Subject: | Optimizing NOT IN plans / verify rewrite |
Date: | 2010-08-02 19:12:51 |
Message-ID: | AANLkTim0entqazUpWhwotC+gcDGrmeM=a=WPJUv=MgdO@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
Hi,
I'm running PostgreSQL 8.3 and I have a query with a couple of NOT IN
subqueries:
DELETE FROM foo WHERE type = 'o' AND b NOT IN (SELECT cqc.b FROM bar
cqc) AND b NOT IN (SELECT car.b FROM foo car WHERE car.type != 'o');
The plan produced for this is:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Index Scan using foo_type_index on foo (cost=17851.93..1271633830.75
rows=66410 width=6)
Index Cond: (type = 'o'::bpchar)
Filter: ((NOT (subplan)) AND (NOT (subplan)))
SubPlan
-> Materialize (cost=6077.87..10238.57 rows=299170 width=8)
-> Seq Scan on bar cqc (cost=0.00..4609.70 rows=299170 width=8)
-> Materialize (cost=11774.06..15728.45 rows=284339 width=8)
-> Seq Scan on foo car (cost=0.00..10378.73 rows=284339 width=8)
Filter: (type <> 'o'::bpchar)
(9 rows)
Unfortunately, when these tables get large-ish, the materilzations get
really expensive to re-scan for every tuple (see cost above). At the
moment, I have ~500k rows in foo and ~300k rows in bar. The
selectivity of type = 'o' is ~50%. I've tried to re-write the query as
follows:
DELETE FROM foo WHERE b IN (SELECT candidate_run.type_o_run as b FROM
(SELECT cqar1.b AS type_o_run, cqar2.b AS non_type_o_run FROM foo
cqar1 LEFT OUTER JOIN foo cqar2 ON (cqar1.b = cqar2.b AND cqar2.type
!= 'o') WHERE cqar1.type = 'o') candidate_run LEFT OUTER JOIN bar ON
(candidate_run.type_o_run = bar.b) WHERE non_type_o_run IS NULL AND
bar.b IS NULL);
This gives the more sensible plan:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
Hash IN Join (cost=48999.81..71174.41 rows=66410 width=6)
Hash Cond: (foo.b = cqar1.b)
-> Seq Scan on foo (cost=0.00..9003.78 rows=549978 width=14)
-> Hash (cost=47909.68..47909.68 rows=66410 width=8)
-> Hash Left Join (cost=24562.29..47909.68 rows=66410 width=8)
Hash Cond: (cqar1.b = bar.b)
Filter: (bar.b IS NULL)
-> Hash Left Join (cost=15043.96..33635.58 rows=132820 width=8)
Hash Cond: (cqar1.b = cqar2.b)
Filter: (cqar2.b IS NULL)
-> Seq Scan on foo cqar1 (cost=0.00..10378.73
rows=265639 width=8)
Filter: (type = 'o'::bpchar)
-> Hash (cost=10378.73..10378.73 rows=284339 width=8)
-> Seq Scan on foo cqar2
(cost=0.00..10378.73 rows=284339 width=8)
Filter: (type <> 'o'::bpchar)
-> Hash (cost=4609.70..4609.70 rows=299170 width=8)
-> Seq Scan on bar (cost=0.00..4609.70
rows=299170 width=8)
(17 rows)
As far as I can tell, the results are identical.
My questions
1. Is my rewrite valid?
2. Any way to reliably achieve the second plan (or really, any plan
that doesn't rescan ~~500k tuples per each of ~250k tuples) by
tweaking (per-session) planner constants? I've played with this a
little, but without much success. As with any rewrite situation, I'd
prefer to stick with the simpler, more explicit original query.
Here is a SQL script to reproduce the problem:
\set ON_ERROR_STOP
drop schema if exists not_in_test cascade;
create schema not_in_test;
set search_path to not_in_test;
create table foo (
a oid not null,
b bigint not null,
type char not null,
ts timestamp without time zone not null
);
create index "foo_b_a_type_index" on foo (b, a, type);
create index "foo_a_index" on foo (a);
create index "foo_type_index" on foo(type);
create table bar (
b bigint unique not null,
c timestamp with time zone not null
);
create index "bar_b_index" on bar(b);
insert into foo select (random()*10)::integer,
generate_series(1,550000), case when random() > 0.5 then 'o' else 'x'
end, now();
insert into bar select val, now() from generate_series(1,1200000) as
vals(val) where random() > 0.75;
analyze foo;
analyze bar;
EXPLAIN DELETE FROM foo WHERE type = 'o' AND b NOT IN (SELECT cqc.b
FROM bar cqc) AND b NOT IN (SELECT car.b FROM foo car WHERE car.type
!= 'o');
EXPLAIN DELETE FROM foo WHERE b IN (SELECT candidate_run.type_o_run as
b FROM (SELECT cqar1.b AS type_o_run, cqar2.b AS non_type_o_run FROM
foo cqar1 LEFT OUTER JOIN foo cqar2 ON (cqar1.b = cqar2.b AND
cqar2.type != 'o') WHERE cqar1.type = 'o') candidate_run LEFT OUTER
JOIN bar ON (candidate_run.type_o_run = bar.b) WHERE non_type_o_run IS
NULL AND bar.b IS NULL);
Thanks,
---
Maciek Sakrejda | System Architect | Truviso
1065 E. Hillsdale Blvd., Suite 215
Foster City, CA 94404
(650) 242-3500 Main
www.truviso.com
From | Date | Subject | |
---|---|---|---|
Next Message | Kevin Grittner | 2010-08-02 19:29:29 | Re: Optimizing NOT IN plans / verify rewrite |
Previous Message | Tom Lane | 2010-08-02 17:29:47 | Re: what does "initplan" operation in explain output mean? |