From: | "W(dot)H(dot) van Atteveldt" <wouter(at)2at(dot)nl> |
---|---|
To: | <pgsql-performance(at)postgresql(dot)org> |
Subject: | Postgres query optimization with varchar fields |
Date: | 2004-06-02 11:57:44 |
Message-ID: | C7F588D7403F1F4591EB34F1E2B5CFD10B1A3E@GAIA.2at.local |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
Dear reader,
I am investigating whether it is useful to directly query a database
containing a rather large text corpus (order of magnitude 100k - 1m
newspaper articles, so around 100 million words), or whether I should
use third party text indexing services. I want to know things such as:
how often is a certain word (or pattern) mentioned in an article and how
often it is mentioned with the condition that another word is nearby
(same article or n words distant).
I created a table listing the words one word per row, and created an
index on the word and wordnr columns. An example query would be:
simple: select articleid, count(*) as count from words w where articleid
in (select id from articles where batchid in (84,85,100,101,118,121))
and (word like '<PATTERN>') group by articleid
complex: select articleid, count(*) as count from words w where
articleid in (select id from articles where batchid in
(84,85,100,101,118,121)) and (word like '<PATTERN>') and exists (select
* from words w2 where w.articleid = w2.articleid and (word like
'<PATTERN2>')) group by articleid
According to the diagnostics, the database does use the indices for the
query, but it is still rather slow (around 10 minutes for a 'simple
query', x seconds for a complex one)
It is important that the complex query only counts instances where the
PATTERN is found and PATTERN2 only functions as a criterium and does not
add to the count.
My questions are: (technical details provided below)
- Does anyone disagree with the general setup?
- Is there a more sensible way to phrase my SQL?
- Any other ideas to improve performance?
Thanks,
Wouter van Atteveldt
Free University Amsterdam
------
Technicalities:
I am using a Postgresql 7.4.1 database on a linux machine (uname -a:
Linux swpc450.cs.vu.nl 2.4.22-1.2115.nptl #1 Wed Oct 29 15:31:21 EST
2003 i686 athlon i386 GNU/Linux). The table of interest is: (lemma, pos,
simplepos currently not used)
Table "public.words"
Column | Type | Modifiers
------------+------------------------+----------------------------------
---------------------
id | integer | not null default
nextval('public.words_id_seq'::text)
articleid | integer | not null
sentencenr | integer | not null
word | character varying(255) | not null
lemma | character varying(255) |
pos | character varying(255) |
simplepos | character(1) |
wordnr | integer | not null
parnr | integer | not null
Indexes:
"words_pkey" primary key, btree (id)
"words_aid" btree (articleid)
"words_word" btree (word)
"words_word_ptrn" btree (word varchar_pattern_ops)
"words_wordnr" btree (wordnr)
Query plans:
anoko=> explain select articleid, count(*) as count from words w where
articleid in (select id from articles where batchid in
(84,85,100,101,118,121)) and (word like 'integratie%') group by
articleid;
QUERY PLAN
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------
HashAggregate (cost=937959.21..937959.22 rows=2 width=4)
-> Hash IN Join (cost=95863.70..937816.01 rows=28640 width=4)
Hash Cond: ("outer".articleid = "inner".id)
-> Index Scan using words_word_ptrn on words w
(cost=0.00..836604.62 rows=208886 width=4)
Index Cond: (((word)::text ~>=~ 'integratie'::character
varying) AND ((word)::text ~<~ 'integratif'::character varying))
Filter: ((word)::text ~~ 'integratie%'::text)
-> Hash (cost=94998.60..94998.60 rows=146041 width=4)
-> Index Scan using articles_batchid, articles_batchid,
articles_batchid, articles_batchid, articles_batchid, articles_batchid
on articles (cost=0.00..94998.60 rows=146041 width=4)
Index Cond: ((batchid = 84) OR (batchid = 85) OR
(batchid = 100) OR (batchid = 101) OR (batchid = 118) OR (batchid =
121))
explain select articleid, count(*) as count from words w where articleid
in (select id from articles where batchid in (84,85,100,101,118,121))
and (word like '<PATTERN>') and exists (select * from words w2 where
w.articleid = w2.articleid and (word like '<PATTERN2>')) group by
articleid
anoko-> ;
QUERY PLAN
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------
GroupAggregate (cost=168253089.23..168254556.46 rows=1 width=4)
-> Merge IN Join (cost=168253089.23..168254484.85 rows=14320
width=4)
Merge Cond: ("outer".articleid = "inner".id)
-> Sort (cost=168144438.23..168144699.33 rows=104443 width=4)
Sort Key: w.articleid
-> Index Scan using words_word_ptrn on words w
(cost=0.00..168134972.17 rows=104443 width=4)
Index Cond: ((word)::text ~=~
'<PATTERN>'::character varying)
Filter: (((word)::text ~~ '<PATTERN>'::text) AND
(subplan))
SubPlan
-> Index Scan using words_aid on words w2
(cost=0.00..836948.84 rows=1045 width=460)
Index Cond: ($0 = articleid)
Filter: ((word)::text ~~
'<PATTERN2>'::text)
-> Sort (cost=108651.01..109016.11 rows=146041 width=4)
Sort Key: articles.id
-> Index Scan using articles_batchid, articles_batchid,
articles_batchid, articles_batchid, articles_batchid, articles_batchid
on articles (cost=0.00..94998.60 rows=146041 width=4)
Index Cond: ((batchid = 84) OR (batchid = 85) OR
(batchid = 100) OR (batchid = 101) OR (batchid = 118) OR (batchid =
121))
From | Date | Subject | |
---|---|---|---|
Next Message | Greg Stark | 2004-06-02 14:27:36 | Re: PostgreSQL on VMWare vs Windows vs CoLinux |
Previous Message | Rod Taylor | 2004-06-02 03:31:50 | Re: PostgreSQL and Kernel 2.6.x |