hash-join forgets tuples

From: Sebastian Freundt <hroptatyr(at)gna(dot)org>
To: pgsql-general(at)postgresql(dot)org
Subject: hash-join forgets tuples
Date: 2005-08-15 11:58:42
Message-ID: nhtk6inquml.fsf@mack.math.tu-berlin.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general


Hello,

using a highly surjective left (or inner) join to a table reveals data
loss if the hash join method is used.
Here, highly surjective means I have a table with about 1.4 million tuples
which map to a table with about 40000 tuples.

Now here's the explanation:

qaos=# explain select anfs.anf_id from anfs left join groups USING (group_id) where
anfs.degree=3 and not groups.abelian_p;
QUERY PLAN
------------------------------------------------------------------------------------------------------
Hash Join (cost=8111.80..103961.04 rows=299036 width=521)
Hash Cond: ("outer".group_id = "inner".group_id)
-> Bitmap Heap Scan on anfs (cost=2048.58..71382.50 rows=325594 width=216)
Recheck Cond: (degree = 3)
-> Bitmap Index Scan on anfs_degree (cost=0.00..2048.58 rows=325594 width=0)
Index Cond: (degree = 3)
-> Hash (cost=4275.79..4275.79 rows=40172 width=313)
-> Bitmap Heap Scan on groups (cost=346.60..4275.79 rows=40172 width=313)
Filter: (NOT abelian_p)
-> Bitmap Index Scan on groups_not_properties (cost=0.00..346.60 rows=40119 width=0)
Index Cond: (abelian_p = false)
(11 rows)

The query result is:

qaos=# select anfs.anf_id from anfs left join groups USING (group_id)
where anfs.degree=3 and not groups.abelian_p;
anf_id
--------
(0 rows)

Now whenever other join-methods (like merge join) are used, there are
definitely tuples in the result:

qaos=# select anfs.anf_id from anfs left join groups USING (group_id)
where anfs.degree=3 and not groups.abelian_p limit 20;
anf_id
--------
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
(20 rows)

qaos=# explain select anfs.anf_id from anfs left join groups USING
(group_id) where anfs.degree=3 and not groups.abelian_p limit 20;
QUERY PLAN
------------------------------------------------------------------------------------------------
Limit (cost=0.00..138.09 rows=20 width=8)
-> Nested Loop (cost=0.00..2064738.73 rows=299036 width=8)
-> Index Scan using anfs_degree on anfs (cost=0.00..1078774.74 rows=325594 width=16)
Index Cond: (degree = 3)
-> Index Scan using groups_group_id on groups (cost=0.00..3.02 rows=1 width=8)
Index Cond: ("outer".group_id = groups.group_id)
Filter: (NOT abelian_p)
(7 rows)

The canonical step in this misery, SET enable_hashjoin TO off; makes
queries work again, but the performance really suffers from it.

Greetings,
Sebastian

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Richard Sydney-Smith 2005-08-15 12:58:52 Authenticating using Windows XP login
Previous Message Tino Wildenhain 2005-08-15 10:28:28 Re: atomic function