Re: Slow self-join on a 100 million record table

From: "Paul Thornett" <pault(at)terrigal(dot)net(dot)au>
To: pgsql-sql(at)postgresql(dot)org
Subject: Re: Slow self-join on a 100 million record table
Date: 2003-01-02 03:30:00
Message-ID: ONtn28gsCHA.440@TK2MSFTNGP12
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

I've played with this quite a lot on a fairly similar machine to
yours - i.e. Dual pentium III 700s, 5x4.5Gb 10k Scsi disks Hardware
Raid0, 1 Gb. Ram, Windows 2000 Professional (SP3), Sql Server 2000
Desktop (SP2). As expected my elapse times are almost exactly twice as
long as yours.

I created my 100 million rows using an 8k file of words which I then
copied and appended until it contained 100 million rows. Then I DTssed
it into my table:
CREATE TABLE test
(wid int IDENTITY (1, 1) NOT NULL,
word varchar(20) NOT NULL) ON [PRIMARY]

I went through numerous sequences, finally ending up with the data and
clustered index in the Primary Filegroup on my Raid0 disks, and the
nonclustered primary key (wid) in a secondary Filegroup on a separate
Firewire disk.

I also simplified your Sql (I don't see any need for subqueries), as
follows:

SELECT Refs=COUNT(*), t1.word, t2.word, t3.word
FROM test t1
JOIN test t2
ON t2.wid=t1.wid+1
AND t1.word in ('any','of','a')
AND t2.word in ('corel','woman','person')
JOIN test t3
ON t3.wid=t1.wid+2
AND t3.word in ('ventura','that','which')
GROUP BY t1.word, t2.word, t3.word

This gives me:

Refs word word word
---------------------------------
23086 a Corel VENTURA
57715 of Corel VENTURA

All the permutations I tried gave an identical result - it always
takes 30 seconds to run the above query on my machine, or your version
of the same query, or any number of other ideas I tried. Just to
select the first word with no joins takes 5 seconds. I was surprised
the secondary Filegroup didn't improve speed at all.

I feel like I've been on a long journey, and ended up exactly where I
started!!

--
Paul Thornett

"Mark Davies" <mdavies(at)ilstu(dot)edu> wrote in message
news:a28f4bc4(dot)0212310746(dot)4a272b00(at)posting(dot)google(dot)com(dot)(dot)(dot)
> I have a database containing 100 million records, in which each
record
> contains (in sequence) all of the words in a 100 million word
> collection of texts. There are two columns: TheID (offset value)
and
> TheWord (sequential words), e.g.:
>
> TheID TheWord
> ---- -----
>
> 1 I
> 2 saw
> 3 the
> 4 man
> 5 that
> 6 came
> . . .
> 100000000 xxx
>
> To extract strings, I then use self-joins on this one table, in
which
> [ID], [ID-1], [ID+1] etc are used to find preceding and following
> words, e.g.:
> select count(*),w1.w1,w2.w1,w3.w1 from
> ((select w1, ID+1 as ID from seq where w1 in ('the','that','this'))
w1
> inner join
> (select w1, ID as ID from seq where w1 in ('man','woman','person'))
w2
> on w2.ID = w1.ID)
> inner join
> (select w1, ID-1 as ID from seq where w1 in ('who','that','which'))
w3
> on w3.ID=w1.ID
> group by w1.w1,w2.w1,w3.w1
>
> This would yield results like "the man that" (words 3-5 above),"that
> woman who","this man which", etc.
>
> The problem is, the self-join solution is extremely slow. I have a
> SQL Server 7.0 database with a clustered index on TheWord
(sequential
> words) and a normal index on TheID. Even with all of this, however,
a
> self-join query like the one just listed takes about 15 seconds on
my
> machine (dual CPU 1.2GHz, 4GB RAM, three 10K rpm SCSI HD w/ RAID-0).
>
> Any suggestions? Have I messed up in terms of the SQL statement?
> Thanks in advance for any help that you can give.

In response to

Browse pgsql-sql by date

  From Date Subject
Next Message Minghann Ho 2003-01-02 04:36:23 bug?
Previous Message Daniel Bruce Lynes 2003-01-01 21:54:40 Internal functions documented