Re: Curious why planner can't handle NOT IN

From: pradeep singh <pradeepsingh1988(at)gmail(dot)com>
To: Yang Zhang <yanghatespam(at)gmail(dot)com>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Curious why planner can't handle NOT IN
Date: 2013-05-03 15:30:12
Message-ID: CA+TUHZU=+XGhnF82OC=Dkuj4afCh8kvGpmhiNtA+vx38bL4kYw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On May 3, 2013 12:15 PM, "Yang Zhang" <yanghatespam(at)gmail(dot)com> wrote:

> I have an `account` table with 5.3M rows, with primary key `id` of
> type `text` (and 600+ columns if that matters).
>
> I'm trying to create a `newaccount` table with the same schema but
> 600k newly imported rows, then insert all the old rows for which `id`
> isn't already in the set of newly updated rows.
>
> => create table newaccount as select * from account limit 0;
>
> => \copy newaccount from stdin with (format csv)
> [...copy 600k rows in...]
>
> => analyze newaccount;
>
> The most obvious query doesn't work so hot due to repeated execution
> of the subplan:
>
> => explain insert into newaccount select * from account where id not
> in (select id from account);
>
> QUERY PLAN
>
> -----------------------------------------------------------------------------------------
> Insert on newaccount (cost=0.00..458800400224.89 rows=2985976 width=5366)
> -> Seq Scan on account (cost=0.00..458800400224.89 rows=2985976
> width=5366)
> Filter: (NOT (SubPlan 1))
> SubPlan 1
> -> Materialize (cost=0.00..153276.00 rows=150000 width=32)
> -> Seq Scan on newaccount (cost=0.00..151500.00
> rows=150000 width=32)
> (6 rows)
>
> This works fine, though:
>
> => explain insert into newaccount select * from account where id in
> (select id from account except select id from newaccount);
>
> QUERY PLAN
>
> ------------------------------------------------------------------------------------------------------------------------
> Insert on newaccount (cost=3749026.24..14034547.77 rows=2985976
> width=5366)
> -> Hash Join (cost=3749026.24..14034547.77 rows=2985976 width=5366)
> Hash Cond: (account.id = "ANY_subquery".id)
> -> Seq Scan on account (cost=0.00..2263744.51 rows=5971951
> width=5366)
> -> Hash (cost=3633552.85..3633552.85 rows=5971951 width=32)
> -> Subquery Scan on "ANY_subquery"
> (cost=3543223.59..3633552.85 rows=5971951 width=32)
> -> SetOp Except (cost=3543223.59..3573833.34
> rows=5971951 width=19)
> -> Sort (cost=3543223.59..3558528.47
> rows=6121951 width=19)
> Sort Key: "*SELECT* 1".id
> -> Append (cost=0.00..2476464.02
> rows=6121951 width=19)
> -> Subquery Scan on "*SELECT*
> 1" (cost=0.00..2323464.02 rows=5971951 width=19)
> -> Seq Scan on account
> (cost=0.00..2263744.51 rows=5971951 width=19)
> -> Subquery Scan on "*SELECT*
> 2" (cost=0.00..153000.00 rows=150000 width=32)
> -> Seq Scan on
> newaccount (cost=0.00..151500.00 rows=150000 width=32)
> (14 rows)
>
> This is all in PG 9.1.
>
> This isn't a big deal as there's a straightforward workaround, but I
> am curious what happened here. Googling turns up various mentions of
> "NOT IN" with poor plans that involve subplans. Then again I have
> read mention of hash anti-join which seems appropriate here(?), but
> that wasn't used here (nor was the latter join used, though for
> whatever reason it looks like a more complex/deeper plan tree than I
> had expected, so maybe it was out of the plan generator's reach?).
>
> E.g. the following mentions cranking up work_mem, but I probably can't
> crank up work_mem to meet the requirements of this example, and even
> if it doesn't fit in memory, it'd be nice for the planner to not
> degenerate to a pathological plan and still execute this join
> efficiently while spilling to and from disk.
>
>
> http://stackoverflow.com/questions/7125291/postgresql-not-in-versus-except-performance-difference-edited-2
>
> Thanks!
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Doug Gorley 2013-05-03 17:41:53 How to monitor recovery on Windows?
Previous Message Tom Lane 2013-05-03 14:42:53 Re: SPI_execute_with_args call