From: | Bruce Momjian <bruce(at)momjian(dot)us> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | "Dirschel, Steve" <steve(dot)dirschel(at)thomsonreuters(dot)com>, "pgsql-general(at)lists(dot)postgresql(dot)org" <pgsql-general(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Postgres NOT IN vs NOT EXISTS optimization |
Date: | 2022-08-11 20:12:33 |
Message-ID: | YvVisTU3tb8Vcg2v@momjian.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general pgsql-hackers |
On Tue, Jun 14, 2022 at 12:09:16PM -0400, Tom Lane wrote:
> "Dirschel, Steve" <steve(dot)dirschel(at)thomsonreuters(dot)com> writes:
> > Is Postgres able to drive the query the same way with the NOT IN as the
> > NOT EXISTS is doing or is that only available if the query has a NOT
> > EXISTS?
>
> NOT IN is not optimized very well in PG, because of the strange
> semantics that the SQL spec demands when the sub-query produces any
> null values. There's been some interest in detecting cases where
> we can prove that the subquery produces no nulls and then optimizing
> it into NOT EXISTS, but it seems like a lot of work for not-great
> return, so nothing's happened (yet). Perhaps Oracle does something
> like that already, or perhaps they're just ignoring the semantics
> problem; they do not have a reputation for hewing closely to the
> spec on behavior regarding nulls.
I was just now researching NOT IN behavior and remembered this thread,
so wanted to give a simplified example. If you set up tables like this:
CREATE TABLE small AS
SELECT * FROM generate_series(1, 10) AS t(x);
CREATE TABLE large AS SELECT small.x
FROM small CROSS JOIN generate_series(1, 1000) AS t(x);
INSERT INTO small VALUES (11), (12);
ANALYZE small, large;
These IN and EXISTS/NOT EXISTS queries look fine. using hash joins:
EXPLAIN SELECT small.x
FROM small
WHERE small.x IN (SELECT large.x FROM large);
QUERY PLAN
-----------------------------------------------------------------------------
Hash Join (cost=170.22..171.49 rows=10 width=4)
Hash Cond: (small.x = large.x)
-> Seq Scan on small (cost=0.00..1.12 rows=12 width=4)
-> Hash (cost=170.10..170.10 rows=10 width=4)
-> HashAggregate (cost=170.00..170.10 rows=10 width=4)
Group Key: large.x
-> Seq Scan on large (cost=0.00..145.00 rows=10000 width=4)
EXPLAIN SELECT small.x
FROM small
WHERE EXISTS (SELECT large.x FROM large WHERE large.x = small.x);
QUERY PLAN
-----------------------------------------------------------------------------
Hash Join (cost=170.22..171.49 rows=10 width=4)
Hash Cond: (small.x = large.x)
-> Seq Scan on small (cost=0.00..1.12 rows=12 width=4)
-> Hash (cost=170.10..170.10 rows=10 width=4)
-> HashAggregate (cost=170.00..170.10 rows=10 width=4)
Group Key: large.x
-> Seq Scan on large (cost=0.00..145.00 rows=10000 width=4)
EXPLAIN SELECT small.x
FROM small
WHERE NOT EXISTS (SELECT large.x FROM large WHERE large.x = small.x);
QUERY PLAN
-----------------------------------------------------------------------
Hash Anti Join (cost=270.00..271.20 rows=2 width=4)
Hash Cond: (small.x = large.x)
-> Seq Scan on small (cost=0.00..1.12 rows=12 width=4)
-> Hash (cost=145.00..145.00 rows=10000 width=4)
-> Seq Scan on large (cost=0.00..145.00 rows=10000 width=4)
These NOT IN queries all use sequential scans, and IS NOT NULL does not help:
EXPLAIN SELECT small.x
FROM small
WHERE small.x NOT IN (SELECT large.x FROM large);
QUERY PLAN
-------------------------------------------------------------------
Seq Scan on small (cost=170.00..171.15 rows=6 width=4)
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
-> Seq Scan on large (cost=0.00..145.00 rows=10000 width=4)
EXPLAIN SELECT small.x
FROM small
WHERE small.x NOT IN (SELECT large.x FROM large WHERE large.x IS NOT NULL);
QUERY PLAN
-------------------------------------------------------------------
Seq Scan on small (cost=170.00..171.15 rows=6 width=4)
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
-> Seq Scan on large (cost=0.00..145.00 rows=10000 width=4)
Filter: (x IS NOT NULL)
Is converting NOT IN to NOT EXISTS our only option? Couldn't we start
to create the hash and just switch to always returning NULL if we see
any NULLs while we are creating the hash?
--
Bruce Momjian <bruce(at)momjian(dot)us> https://momjian.us
EDB https://enterprisedb.com
Indecision is a decision. Inaction is an action. Mark Batterson
From | Date | Subject | |
---|---|---|---|
Next Message | Theofilos Theofovos | 2022-08-11 20:19:09 | Re: Strategy for preparing a query containg dynamic case / when |
Previous Message | David G. Johnston | 2022-08-11 17:07:28 | Re: Surprising results from current_role in a "security invoker" trigger function in a "cascade delete via FK" scenario |
From | Date | Subject | |
---|---|---|---|
Next Message | Bruce Momjian | 2022-08-11 20:50:48 | Re: Postgres NOT IN vs NOT EXISTS optimization |
Previous Message | Andres Freund | 2022-08-11 20:03:43 | test failure with gcc-12 -O3 -march=native |