From: | "Andrew J(dot) Kelly" <akelly(at)targitinteractive(dot)com> |
---|---|
To: | pgsql-sql(at)postgresql(dot)org |
Subject: | Re: Slow self-join on a 100 million record table |
Date: | 2002-12-31 16:04:35 |
Message-ID: | e9kyJZOsCHA.1808@TK2MSFTNGP09 |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-sql |
Mark,
I doubt very much you will ever get much faster results (without increasing
hardware) in a situation such as that. Your queries don't look selective
enough to effectively use the indexes. What is the query plan for each of
the individual selects and what does it look like as a whole? How many
rows does each individual select return and how many for the final
statement?
--
Andrew J. Kelly
SQL Server MVP
"Mark Davies" <mdavies(at)ilstu(dot)edu> wrote in message
news:a28f4bc4(dot)0212310746(dot)4a272b00(at)posting(dot)google(dot)com(dot)(dot)(dot)
> I have a database containing 100 million records, in which each record
> contains (in sequence) all of the words in a 100 million word
> collection of texts. There are two columns: TheID (offset value) and
> TheWord (sequential words), e.g.:
>
> TheID TheWord
> ---- -----
>
> 1 I
> 2 saw
> 3 the
> 4 man
> 5 that
> 6 came
> . . .
> 100000000 xxx
>
> To extract strings, I then use self-joins on this one table, in which
> [ID], [ID-1], [ID+1] etc are used to find preceding and following
> words, e.g.:
>
> select count(*),w1.w1,w2.w1,w3.w1 from
> ((select w1, ID+1 as ID from seq where w1 in ('the','that','this')) w1
> inner join
> (select w1, ID as ID from seq where w1 in ('man','woman','person')) w2
> on w2.ID = w1.ID)
> inner join
> (select w1, ID-1 as ID from seq where w1 in ('who','that','which')) w3
> on w3.ID=w1.ID
> group by w1.w1,w2.w1,w3.w1
>
> This would yield results like "the man that" (words 3-5 above),"that
> woman who","this man which", etc.
>
> The problem is, the self-join solution is extremely slow. I have a
> SQL Server 7.0 database with a clustered index on TheWord (sequential
> words) and a normal index on TheID. Even with all of this, however, a
> self-join query like the one just listed takes about 15 seconds on my
> machine (dual CPU 1.2GHz, 4GB RAM, three 10K rpm SCSI HD w/ RAID-0).
>
> Any suggestions? Have I messed up in terms of the SQL statement?
> Thanks in advance for any help that you can give.
>
> Mark Davies
> Illinois State University
>
> P.S. Yes, I know about Full-Text Indexing in SQL Server, but it's not
> adequate for my purposes -- there's a lot more to the project than
> what I've described here.
From | Date | Subject | |
---|---|---|---|
Next Message | Jie Liang | 2002-12-31 19:09:33 | Re: plpgsql select into question |
Previous Message | J D | 2002-12-31 15:49:57 | plpgsql select into question |