Re: "large" IN/NOT IN subqueries result in query returning

From: John McCawley <nospam(at)hardgeus(dot)com>
To: George Pavlov <gpavlov(at)mynewplace(dot)com>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: "large" IN/NOT IN subqueries result in query returning
Date: 2005-12-27 19:53:30
Message-ID: 43B19BBA.8050403@hardgeus.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

At a glance I would guess that NULL values in one or both of your tables
is the culprit. NULL values always return false.

Example:

A quick test on my database:

select count(*) FROM tbl_employee;

count
-------
2689

select count(*) FROM tbl_employee WHERE username IS NULL;
count
-------
35

So I have 35 null values.

create table tbl_foo (username varchar(32));

insert into tbl_foo (username) values ('poop');

select count(*) FROM tbl_employee WHERE tbl_employee.username NOT IN
(select tbl_foo.username FROM tbl_foo);
count
-------
2654

So I only get 2654 values that are NOT IN the set 'poop'...i.e. the NULL
values are not included when I use the "NOT IN" my query. Nulls can be
confusing.

Hope this helps.

George Pavlov wrote:

>The following looks like a bug to me, but please someone review and tell
>me what I might be missing. Seems that past a certain result set size a
>"[NOT] IN (subquery)" stops behaving as expected and returns 0 matches
>even when there should be matches. No errors are returned, just faulty
>data. The exact threshholds seem to depend on the data, subquery and
>possibly the indexes in place. Nothing in the documentation mentions a
>limit to what "IN subquery" can take so I am quite dismayed (if there is
>a limit I would expect an ERROR/WARNING/something, but not wrong data).
>Read below for details. This has been tried on PG 8.0.4 and 8.1.1 on
>WinXP, and 8.0.4 and 8.0.5 on Linux.
>
>I have two tables:
> t1 (id, name, a_type)
> t2 (id, name)
>
>The t1 table is "big" (483211 rows), the name column contains
>duplicates, the a_type has only two values. The t2 table is "small" (40
>rows), the name values are unique. Some, but not all, t2.name values are
>present in t1.name. To be precise, there are 10 t2.name values that do
>not occur in t1.name (this is based on extraneous knowledge). I want to
>find out which of the t2.name values are not ever used in t1.name:
>
>select count(*) from t2 where t2.name not in (
> select t1.name from t1);
>--> 0
>
>This should return 10, instead it returns 0!!! Adding a LIMIT to the
>subquery and doing some trial and error produces very interesting
>results:
>
>select count(*) from t2 where t2.name not in (
> select t1.name from t1 limit 261683)
>--> 13
>select count(*) from t2 where t2.name not in (
> select t1.name from t1 limit 261684)
>--> 0
>
>What is so magical about 261683? The JOIN alternative produces the
>correct results no matter what:
>
>select count(*)
> from t2 left join t1 using (name)
> where t1.name is null
>--> 10
>
>This pretty much summarizes the issue. Any thoughts greatly appreciated.
>Follow a few variations to show how the threshhold varies.
>
>-- restrict to only one a_type
>
>

In response to

Browse pgsql-sql by date

  From Date Subject
Next Message George Pavlov 2005-12-27 20:36:21 Re: "large" IN/NOT IN subqueries result in query returning wrong data
Previous Message Tom Lane 2005-12-27 19:44:51 Re: "large" IN/NOT IN subqueries result in query returning wrong data