From: | mdavies(at)ilstu(dot)edu (Mark Davies) |
---|---|
To: | pgsql-sql(at)postgresql(dot)org |
Subject: | Slow self-join on a 100 million record table |
Date: | 2002-12-31 15:46:13 |
Message-ID: | a28f4bc4.0212310746.4a272b00@posting.google.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-sql |
I have a database containing 100 million records, in which each record
contains (in sequence) all of the words in a 100 million word
collection of texts. There are two columns: TheID (offset value) and
TheWord (sequential words), e.g.:
TheID TheWord
---- -----
1 I
2 saw
3 the
4 man
5 that
6 came
. . .
100000000 xxx
To extract strings, I then use self-joins on this one table, in which
[ID], [ID-1], [ID+1] etc are used to find preceding and following
words, e.g.:
select count(*),w1.w1,w2.w1,w3.w1 from
((select w1, ID+1 as ID from seq where w1 in ('the','that','this')) w1
inner join
(select w1, ID as ID from seq where w1 in ('man','woman','person')) w2
on w2.ID = w1.ID)
inner join
(select w1, ID-1 as ID from seq where w1 in ('who','that','which')) w3
on w3.ID=w1.ID
group by w1.w1,w2.w1,w3.w1
This would yield results like "the man that" (words 3-5 above),"that
woman who","this man which", etc.
The problem is, the self-join solution is extremely slow. I have a
SQL Server 7.0 database with a clustered index on TheWord (sequential
words) and a normal index on TheID. Even with all of this, however, a
self-join query like the one just listed takes about 15 seconds on my
machine (dual CPU 1.2GHz, 4GB RAM, three 10K rpm SCSI HD w/ RAID-0).
Any suggestions? Have I messed up in terms of the SQL statement?
Thanks in advance for any help that you can give.
Mark Davies
Illinois State University
P.S. Yes, I know about Full-Text Indexing in SQL Server, but it's not
adequate for my purposes -- there's a lot more to the project than
what I've described here.
From | Date | Subject | |
---|---|---|---|
Next Message | J D | 2002-12-31 15:49:57 | plpgsql select into question |
Previous Message | floyds | 2002-12-30 17:40:44 | empty arrays |