From: | John McCawley <nospam(at)hardgeus(dot)com> |
---|---|
To: | George Pavlov <gpavlov(at)mynewplace(dot)com> |
Cc: | pgsql-sql(at)postgresql(dot)org |
Subject: | Re: "large" IN/NOT IN subqueries result in query returning |
Date: | 2005-12-27 19:53:30 |
Message-ID: | 43B19BBA.8050403@hardgeus.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-sql |
At a glance I would guess that NULL values in one or both of your tables
is the culprit. NULL values always return false.
Example:
A quick test on my database:
select count(*) FROM tbl_employee;
count
-------
2689
select count(*) FROM tbl_employee WHERE username IS NULL;
count
-------
35
So I have 35 null values.
create table tbl_foo (username varchar(32));
insert into tbl_foo (username) values ('poop');
select count(*) FROM tbl_employee WHERE tbl_employee.username NOT IN
(select tbl_foo.username FROM tbl_foo);
count
-------
2654
So I only get 2654 values that are NOT IN the set 'poop'...i.e. the NULL
values are not included when I use the "NOT IN" my query. Nulls can be
confusing.
Hope this helps.
George Pavlov wrote:
>The following looks like a bug to me, but please someone review and tell
>me what I might be missing. Seems that past a certain result set size a
>"[NOT] IN (subquery)" stops behaving as expected and returns 0 matches
>even when there should be matches. No errors are returned, just faulty
>data. The exact threshholds seem to depend on the data, subquery and
>possibly the indexes in place. Nothing in the documentation mentions a
>limit to what "IN subquery" can take so I am quite dismayed (if there is
>a limit I would expect an ERROR/WARNING/something, but not wrong data).
>Read below for details. This has been tried on PG 8.0.4 and 8.1.1 on
>WinXP, and 8.0.4 and 8.0.5 on Linux.
>
>I have two tables:
> t1 (id, name, a_type)
> t2 (id, name)
>
>The t1 table is "big" (483211 rows), the name column contains
>duplicates, the a_type has only two values. The t2 table is "small" (40
>rows), the name values are unique. Some, but not all, t2.name values are
>present in t1.name. To be precise, there are 10 t2.name values that do
>not occur in t1.name (this is based on extraneous knowledge). I want to
>find out which of the t2.name values are not ever used in t1.name:
>
>select count(*) from t2 where t2.name not in (
> select t1.name from t1);
>--> 0
>
>This should return 10, instead it returns 0!!! Adding a LIMIT to the
>subquery and doing some trial and error produces very interesting
>results:
>
>select count(*) from t2 where t2.name not in (
> select t1.name from t1 limit 261683)
>--> 13
>select count(*) from t2 where t2.name not in (
> select t1.name from t1 limit 261684)
>--> 0
>
>What is so magical about 261683? The JOIN alternative produces the
>correct results no matter what:
>
>select count(*)
> from t2 left join t1 using (name)
> where t1.name is null
>--> 10
>
>This pretty much summarizes the issue. Any thoughts greatly appreciated.
>Follow a few variations to show how the threshhold varies.
>
>-- restrict to only one a_type
>
>
From | Date | Subject | |
---|---|---|---|
Next Message | George Pavlov | 2005-12-27 20:36:21 | Re: "large" IN/NOT IN subqueries result in query returning wrong data |
Previous Message | Tom Lane | 2005-12-27 19:44:51 | Re: "large" IN/NOT IN subqueries result in query returning wrong data |