JOINs ... how I hate them ...

From: "Marc G(dot) Fournier" <scrappy(at)hub(dot)org>
To: pgsql-hackers(at)postgresql(dot)org
Subject: JOINs ... how I hate them ...
Date: 2002-02-07 14:23:22
Message-ID: 20020207100506.Y57607-100000@earth.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Okay, went back through teh archives, as I know that Tom provided a
solution for this before, and found it at:

http://archives.postgresql.org/pgsql-sql/2001-06/msg00329.php

Plain and simple ... makes perfect sense ... doesn't work in v7.2, or, at
least, not as I'm expecting it to ...

I've broken what I'm trying to do down the the *basest* component I can:

explain SELECT p.uid, p.handle
FROM gender_f pgf JOIN profiles p ON (pgf.uid = p.uid) ;

Which explains out as:

Hash Join (cost=1354.80..45297.83 rows=75664 width=27)
-> Seq Scan on profiles p (cost=0.00..35707.69 rows=485969 width=19)
-> Hash (cost=1165.64..1165.64 rows=75664 width=8)
-> Seq Scan on gender_f pgf (cost=0.00..1165.64 rows=75664 width=8)

Now, profiles has uid as its primary KEY, and there are no
duplicates in gender_f ... so, as my HashJoin points out, I should have 75664
results returned ... that is expected ... and the SeqScan on gender_f is
expected ... but the SeqScan on profiles is what I would hope to get rid
of ... get uid from gender_f, find corresponding entry in profiles ... its only
ever goign to pull out 75664 out of 485969 records from profiles, so why
would it seqscan *through* profiles for each and every UID?

Now, if I go to the next level that I'm trying to pull together:

explain SELECT p.uid, p.handle
FROM ( orientation_c poc JOIN gender_f pgf USING ( uid ) ) JOIN profiles p ON (pgf.uid = p.uid) ;

It still explains, what I think, is wrong:

Hash Join (cost=6023.92..47537.10 rows=75664 width=35)
-> Seq Scan on profiles p (cost=0.00..35707.69 rows=485969 width=19)
-> Hash (cost=5834.76..5834.76 rows=75664 width=16)
-> Merge Join (cost=0.00..5834.76 rows=75664 width=16)
-> Index Scan using poc_uid on orientation_c poc (cost=0.00..2807.82 rows=126477 width=8)
-> Index Scan using pgf_uid on gender_f pgf (cost=0.00..1575.79 rows=75664 width=8)

The MergeJoin between poc/pgf will only return 12000 records, and since it
is a 1:1 relationship between each of those tables, there will *only* be
12000 records pulled from profiles ... yet its doing a SeqScan through all
485k records for each of those UIDs?

This is after I've performed a VACUUM ANALYZE ...

The final query itself is:

SELECT p.uid, p.profiles_handle
FROM ( ( profiles_orientation_c poc JOIN profiles_gender_f pgf USING ( uid ) ) JOIN iwantu_profiles p USING (uid ) ) LEFT JOIN iwantu_last_login ll USING ( uid );

Which explains as:

Hash Join (cost=31636.40..78239.34 rows=75664 width=43)
-> Hash Join (cost=6023.92..47537.10 rows=75664 width=35)
-> Seq Scan on iwantu_profiles p (cost=0.00..35707.69 rows=485969 width=19)
-> Hash (cost=5834.76..5834.76 rows=75664 width=16)
-> Merge Join (cost=0.00..5834.76 rows=75664 width=16)
-> Index Scan using poc_uid on profiles_orientation_c poc (cost=0.00..2807.82 rows=126477 width=8)
-> Index Scan using pgf_uid on profiles_gender_f pgf (cost=0.00..1575.79 rows=75664 width=8)
-> Hash (cost=7955.64..7955.64 rows=485964 width=8)
-> Seq Scan on iwantu_last_login ll (cost=0.00..7955.64 rows=485964 width=8)

EXPLAIN

So, poc&pgf are MergeJoin's, leaving me with 12000 records again ... then
there is the SeqScan/HashJoin wiht profiles, which will leave me with
12000 records, but with more information ... but, again, for each of
*those* 12000 records, its doing a SeqScan on last_login's 485k records,
instead of using the index ... again, like pgf and poc, there is only one
record for every uid, so we aren't dealing with duplicates ...

Now, if I 'set enable_seqscan=false;' and do the exact same explain, it
definitely comes more in line with what I'd like to see, as far as index
usage is concerned:

Nested Loop (cost=0.00..546759.46 rows=75664 width=43)
-> Nested Loop (cost=0.00..272274.75 rows=75664 width=35)
-> Merge Join (cost=0.00..5834.76 rows=75664 width=16)
-> Index Scan using poc_uid on profiles_orientation_c poc (cost=0.00..2807.82 rows=126477 width=8)
-> Index Scan using pgf_uid on profiles_gender_f pgf (cost=0.00..1575.79 rows=75664 width=8)
-> Index Scan using iwantu_profiles_uid on iwantu_profiles p (cost=0.00..3.51 rows=1 width=19)
-> Index Scan using ill_uid on iwantu_last_login ll (cost=0.00..3.62 rows=1 width=8)

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Clift 2002-02-07 14:31:00 Re: Threaded PosgreSQL server
Previous Message Marc G. Fournier 2002-02-07 13:52:42 Re: Threaded PosgreSQL server