From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | David Rowley <dgrowleyml(at)gmail(dot)com> |
Cc: | Jeevan Chalke <jeevan(dot)chalke(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Marti Raudsepp <marti(at)juffo(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Allowing NOT IN to use ANTI joins |
Date: | 2014-07-11 00:17:59 |
Message-ID: | 13766.1405037879@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
David Rowley <dgrowleyml(at)gmail(dot)com> writes:
> Attached is the updated version of the patch.
I spent some time fooling with this patch, cleaning up the join-alias
issue as well as more-cosmetic things. However, while testing it
I realized that the whole thing is based on a false premise: to equate
a NOT IN with an antijoin, we'd have to prove not only that the inner
side is non-nullable, but that the outer comparison values are too.
Here's an example:
regression=# create table zt1 (f1 int);
CREATE TABLE
regression=# insert into zt1 values(1);
INSERT 0 1
regression=# insert into zt1 values(2);
INSERT 0 1
regression=# insert into zt1 values(null);
INSERT 0 1
regression=# create table zt2 (f1 int not null);
CREATE TABLE
regression=# insert into zt2 values(1);
INSERT 0 1
With the patch in place, we get
regression=# explain select * from zt1 where f1 not in (select f1 from zt2);
QUERY PLAN
-------------------------------------------------------------------
Hash Anti Join (cost=64.00..144.80 rows=1200 width=4)
Hash Cond: (zt1.f1 = zt2.f1)
-> Seq Scan on zt1 (cost=0.00..34.00 rows=2400 width=4)
-> Hash (cost=34.00..34.00 rows=2400 width=4)
-> Seq Scan on zt2 (cost=0.00..34.00 rows=2400 width=4)
Planning time: 0.646 ms
(6 rows)
regression=# select * from zt1 where f1 not in (select f1 from zt2);
f1
----
2
(2 rows)
which is the wrong answer, as demonstrated by comparison to the result
without optimization:
regression=# alter table zt2 alter column f1 drop not null;
ALTER TABLE
regression=# explain select * from zt1 where f1 not in (select f1 from zt2);
QUERY PLAN
---------------------------------------------------------------
Seq Scan on zt1 (cost=40.00..80.00 rows=1200 width=4)
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
-> Seq Scan on zt2 (cost=0.00..34.00 rows=2400 width=4)
Planning time: 0.163 ms
(5 rows)
regression=# select * from zt1 where f1 not in (select f1 from zt2);
f1
----
2
(1 row)
We could no doubt fix this by also insisting that the left-side vars
be provably not null, but that's going to make the patch even slower
and even less often applicable. I'm feeling discouraged about whether
this is worth doing in this form.
For the archives' sake, I attach an updated version with the fixes
I'd applied so far.
regards, tom lane
Attachment | Content-Type | Size |
---|---|---|
not_in_anti_join_v0.7.patch | text/x-diff | 34.5 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2014-07-11 00:49:55 | Re: WAL replay bugs |
Previous Message | Michael Paquier | 2014-07-10 23:12:15 | Re: IMPORT FOREIGN SCHEMA statement |