"large" IN/NOT IN subqueries result in query returning wrong data

From: "George Pavlov" <gpavlov(at)mynewplace(dot)com>
To: <pgsql-sql(at)postgresql(dot)org>
Subject: "large" IN/NOT IN subqueries result in query returning wrong data
Date: 2005-12-27 19:41:33
Message-ID: CCB89282FCE1024EA3DCE687A96A5164039EBD13@ehost010-6.exch010.intermedia.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

The following looks like a bug to me, but please someone review and tell
me what I might be missing. Seems that past a certain result set size a
"[NOT] IN (subquery)" stops behaving as expected and returns 0 matches
even when there should be matches. No errors are returned, just faulty
data. The exact threshholds seem to depend on the data, subquery and
possibly the indexes in place. Nothing in the documentation mentions a
limit to what "IN subquery" can take so I am quite dismayed (if there is
a limit I would expect an ERROR/WARNING/something, but not wrong data).
Read below for details. This has been tried on PG 8.0.4 and 8.1.1 on
WinXP, and 8.0.4 and 8.0.5 on Linux.

I have two tables:
t1 (id, name, a_type)
t2 (id, name)

The t1 table is "big" (483211 rows), the name column contains
duplicates, the a_type has only two values. The t2 table is "small" (40
rows), the name values are unique. Some, but not all, t2.name values are
present in t1.name. To be precise, there are 10 t2.name values that do
not occur in t1.name (this is based on extraneous knowledge). I want to
find out which of the t2.name values are not ever used in t1.name:

select count(*) from t2 where t2.name not in (
select t1.name from t1);
--> 0

This should return 10, instead it returns 0!!! Adding a LIMIT to the
subquery and doing some trial and error produces very interesting
results:

select count(*) from t2 where t2.name not in (
select t1.name from t1 limit 261683)
--> 13
select count(*) from t2 where t2.name not in (
select t1.name from t1 limit 261684)
--> 0

What is so magical about 261683? The JOIN alternative produces the
correct results no matter what:

select count(*)
from t2 left join t1 using (name)
where t1.name is null
--> 10

This pretty much summarizes the issue. Any thoughts greatly appreciated.
Follow a few variations to show how the threshhold varies.

-- restrict to only one a_type
--
select count(*) from t2 where t2.name not in (
select t1.name from t1 where t1.a_type='P')
--> 0
select count(*) from t2 where t2.name not in (
select t1.name from t1 where t1.a_type='P' limit 289014)
--> 11
select count(*) from t2 where t2.name not in (
select t1.name from t1 where t1.a_type='P' limit 289015)
--> 0
select count(*)
from t2 left join (select * from t1 where a_type='P') t1 using (name)
where t1.name is null
--> 11

-- try distinct instead
--
select count(*) from t2 where t2.name not in (
select distinct t1.name from t1)
--> 0
select count(*) from t2 where t2.name not in (
select distinct t1.name from t1 limit 56001)
--> 10
select count(*) from t2 where t2.name not in (
select distinct t1.name from t1 limit 56002)
--> 0

-- distinct + a_type restriction (adding the a_type restriction
-- here _lowers_ the threshhold)
--
select count(*) from t2 where t2.name not in (
select distinct t1.name from t1 where t1.a_type='P')
--> 0
select count(*) from t2 where t2.name not in (
select distinct t1.name from t1 where t1.a_type='P' limit 33620)
--> 11
select count(*) from t2 where t2.name not in (
select distinct t1.name from t1 where t1.a_type='P' limit 33621)
--> 0

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Tom Lane 2005-12-27 19:44:51 Re: "large" IN/NOT IN subqueries result in query returning wrong data
Previous Message Ken Winter 2005-12-27 19:32:52 Re: The Information Schema vs the PG Catalog