Slow self-join on a 100 million record table

From: mdavies(at)ilstu(dot)edu (Mark Davies)
To: pgsql-sql(at)postgresql(dot)org
Subject: Slow self-join on a 100 million record table
Date: 2002-12-31 15:46:13
Message-ID: a28f4bc4.0212310746.4a272b00@posting.google.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

I have a database containing 100 million records, in which each record
contains (in sequence) all of the words in a 100 million word
collection of texts. There are two columns: TheID (offset value) and
TheWord (sequential words), e.g.:

TheID TheWord
---- -----

1 I
2 saw
3 the
4 man
5 that
6 came
. . .
100000000 xxx

To extract strings, I then use self-joins on this one table, in which
[ID], [ID-1], [ID+1] etc are used to find preceding and following
words, e.g.:

select count(*),w1.w1,w2.w1,w3.w1 from
((select w1, ID+1 as ID from seq where w1 in ('the','that','this')) w1
inner join
(select w1, ID as ID from seq where w1 in ('man','woman','person')) w2
on w2.ID = w1.ID)
inner join
(select w1, ID-1 as ID from seq where w1 in ('who','that','which')) w3
on w3.ID=w1.ID
group by w1.w1,w2.w1,w3.w1

This would yield results like "the man that" (words 3-5 above),"that
woman who","this man which", etc.

The problem is, the self-join solution is extremely slow. I have a
SQL Server 7.0 database with a clustered index on TheWord (sequential
words) and a normal index on TheID. Even with all of this, however, a
self-join query like the one just listed takes about 15 seconds on my
machine (dual CPU 1.2GHz, 4GB RAM, three 10K rpm SCSI HD w/ RAID-0).

Any suggestions? Have I messed up in terms of the SQL statement?
Thanks in advance for any help that you can give.

Mark Davies
Illinois State University

P.S. Yes, I know about Full-Text Indexing in SQL Server, but it's not
adequate for my purposes -- there's a lot more to the project than
what I've described here.

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message J D 2002-12-31 15:49:57 plpgsql select into question
Previous Message floyds 2002-12-30 17:40:44 empty arrays