Re: How to speed up delete where not in

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: Andrus <kobruleht2(at)hot(dot)ee>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: How to speed up delete where not in
Date: 2015-09-26 11:33:18
Message-ID: CAKJS1f_C5dWqpJCefb4T3RTftRGdxtrCKVJaob18wyh+H=KbJQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 26 September 2015 at 19:53, Andrus <kobruleht2(at)hot(dot)ee> wrote:

> Hi!
>
>
> I'm looking for a way to delete records which do not have child rows on
> big tables where lot of rows needs to be deleted. Both tables have lot of
> other foreign key references.
>
>
> Document headers are in omdok table:
>
> create table omdok ( dokumnr serial primary key, ... );
>
> Document rows are in omrid table
>
> CREATE TABLE omrid
> (
> id serial NOT NULL,
> reanr serial NOT NULL,
> dokumnr integer NOT NULL,
> CONSTRAINT omrid_pkey PRIMARY KEY (id),
> CONSTRAINT omrid_dokumnr_fkey FOREIGN KEY (dokumnr)
> REFERENCES omdok (dokumnr) MATCH SIMPLE
> ON UPDATE CASCADE ON DELETE CASCADE DEFERRABLE INITIALLY
> IMMEDIATE,
> ....
> );
>
>
> I tried
>
> delete from omdok where dokumnr not in (select dokumnr from omrid)
>
> Query it is running currently 15 hours and is still running.
> postgres.exe is using 50% CPU all the time (this is 2 core CPU).
>
> explain delete from omdok where dokumnr not in (select dokumnr
> from omrid)
>
> returns:
>
> "Delete (cost=0.00..21971079433.34 rows=220815 width=6)"
> " -> Seq Scan on omdok (cost=0.00..21971079433.34 rows=220815
> width=6)"
> " Filter: (NOT (SubPlan 1))"
> " SubPlan 1"
> " -> Materialize (cost=0.00..94756.92 rows=1897261 width=4)"
> " -> Seq Scan on omrid (cost=0.00..77858.61
> rows=1897261 width=4)"
>
> - How to delete parents without child rows fast?
>

if you write the query as with a NOT EXISTS, rather than a NOT IN() it
should complete much more quickly.

It's important to know that the semantics of NOT IN are likely not at all
what you think:

For example, in the following query, would you expect it to return the row
with a.a = 3 ?

select * from (values(1),(2),(3)) a(a) WHERE a NOT IN(SELECT b FROM
(VALUES(NULL),(1),(2)) b(b));

The presence of NULL causes this to not behave the way you might think, yet
it works exactly the way the SQL standard dictates.

You could think of this as equivalent to writing:

select * from (values(1),(2),(3)) a(a) WHERE a.a <> NULL AND a.a <> 1 AND
a.a <> 2;

And since a.a <> NULL is 'unknown', this causes the entire WHERE clause to
be false, therefore nothing matches.

For this reason PostgreSQL does not optimise NOT IN() the same way as it
optimises NOT EXISTS().

If you write the query as:

delete from omdok where not exists(select 1 from omrid where omdok.dokumnr
= omrid.dokumnr);

then you might see it execute in a few seconds. Perhaps you should consider
cancelling the current query, perhaps perform a VACUUM on omdoc after
cancelling, and then run the NOT EXISTS version instead.

I mocked up your tables locally, and populated them with the same number of
records as your row estimates in the EXPLAIN you pasted and I got:

test=# delete from omdok where not exists (select 1 from omrid where
omdok.dokumn = omrid.dokumnr);
DELETE 0
Time: 1698.233 ms

Whereas with the NOT IN() I cancelled it after 10 minutes.

Regards

David Rowley

--
David Rowley http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Melvin Davidson 2015-09-26 13:44:27 Re: How to speed up delete where not in
Previous Message Andrus 2015-09-26 07:53:20 How to speed up delete where not in