Postgres query optimization with varchar fields

From: "W(dot)H(dot) van Atteveldt" <wouter(at)2at(dot)nl>
To: <pgsql-performance(at)postgresql(dot)org>
Subject: Postgres query optimization with varchar fields
Date: 2004-06-02 11:57:44
Message-ID: C7F588D7403F1F4591EB34F1E2B5CFD10B1A3E@GAIA.2at.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Dear reader,

I am investigating whether it is useful to directly query a database
containing a rather large text corpus (order of magnitude 100k - 1m
newspaper articles, so around 100 million words), or whether I should
use third party text indexing services. I want to know things such as:
how often is a certain word (or pattern) mentioned in an article and how
often it is mentioned with the condition that another word is nearby
(same article or n words distant).

I created a table listing the words one word per row, and created an
index on the word and wordnr columns. An example query would be:

simple: select articleid, count(*) as count from words w where articleid
in (select id from articles where batchid in (84,85,100,101,118,121))
and (word like '<PATTERN>') group by articleid
complex: select articleid, count(*) as count from words w where
articleid in (select id from articles where batchid in
(84,85,100,101,118,121)) and (word like '<PATTERN>') and exists (select
* from words w2 where w.articleid = w2.articleid and (word like
'<PATTERN2>')) group by articleid

According to the diagnostics, the database does use the indices for the
query, but it is still rather slow (around 10 minutes for a 'simple
query', x seconds for a complex one)

It is important that the complex query only counts instances where the
PATTERN is found and PATTERN2 only functions as a criterium and does not
add to the count.

My questions are: (technical details provided below)
- Does anyone disagree with the general setup?
- Is there a more sensible way to phrase my SQL?
- Any other ideas to improve performance?

Thanks,

Wouter van Atteveldt
Free University Amsterdam

------

Technicalities:

I am using a Postgresql 7.4.1 database on a linux machine (uname -a:
Linux swpc450.cs.vu.nl 2.4.22-1.2115.nptl #1 Wed Oct 29 15:31:21 EST
2003 i686 athlon i386 GNU/Linux). The table of interest is: (lemma, pos,
simplepos currently not used)

Table "public.words"
Column | Type | Modifiers
------------+------------------------+----------------------------------
---------------------
id | integer | not null default
nextval('public.words_id_seq'::text)
articleid | integer | not null
sentencenr | integer | not null
word | character varying(255) | not null
lemma | character varying(255) |
pos | character varying(255) |
simplepos | character(1) |
wordnr | integer | not null
parnr | integer | not null
Indexes:
"words_pkey" primary key, btree (id)
"words_aid" btree (articleid)
"words_word" btree (word)
"words_word_ptrn" btree (word varchar_pattern_ops)
"words_wordnr" btree (wordnr)

Query plans:

anoko=> explain select articleid, count(*) as count from words w where
articleid in (select id from articles where batchid in
(84,85,100,101,118,121)) and (word like 'integratie%') group by
articleid;

QUERY PLAN
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------
HashAggregate (cost=937959.21..937959.22 rows=2 width=4)
-> Hash IN Join (cost=95863.70..937816.01 rows=28640 width=4)
Hash Cond: ("outer".articleid = "inner".id)
-> Index Scan using words_word_ptrn on words w
(cost=0.00..836604.62 rows=208886 width=4)
Index Cond: (((word)::text ~>=~ 'integratie'::character
varying) AND ((word)::text ~<~ 'integratif'::character varying))
Filter: ((word)::text ~~ 'integratie%'::text)
-> Hash (cost=94998.60..94998.60 rows=146041 width=4)
-> Index Scan using articles_batchid, articles_batchid,
articles_batchid, articles_batchid, articles_batchid, articles_batchid
on articles (cost=0.00..94998.60 rows=146041 width=4)
Index Cond: ((batchid = 84) OR (batchid = 85) OR
(batchid = 100) OR (batchid = 101) OR (batchid = 118) OR (batchid =
121))

explain select articleid, count(*) as count from words w where articleid
in (select id from articles where batchid in (84,85,100,101,118,121))
and (word like '<PATTERN>') and exists (select * from words w2 where
w.articleid = w2.articleid and (word like '<PATTERN2>')) group by
articleid
anoko-> ;

QUERY PLAN
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------
GroupAggregate (cost=168253089.23..168254556.46 rows=1 width=4)
-> Merge IN Join (cost=168253089.23..168254484.85 rows=14320
width=4)
Merge Cond: ("outer".articleid = "inner".id)
-> Sort (cost=168144438.23..168144699.33 rows=104443 width=4)
Sort Key: w.articleid
-> Index Scan using words_word_ptrn on words w
(cost=0.00..168134972.17 rows=104443 width=4)
Index Cond: ((word)::text ~=~
'<PATTERN>'::character varying)
Filter: (((word)::text ~~ '<PATTERN>'::text) AND
(subplan))
SubPlan
-> Index Scan using words_aid on words w2
(cost=0.00..836948.84 rows=1045 width=460)
Index Cond: ($0 = articleid)
Filter: ((word)::text ~~
'<PATTERN2>'::text)
-> Sort (cost=108651.01..109016.11 rows=146041 width=4)
Sort Key: articles.id
-> Index Scan using articles_batchid, articles_batchid,
articles_batchid, articles_batchid, articles_batchid, articles_batchid
on articles (cost=0.00..94998.60 rows=146041 width=4)
Index Cond: ((batchid = 84) OR (batchid = 85) OR
(batchid = 100) OR (batchid = 101) OR (batchid = 118) OR (batchid =
121))

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Greg Stark 2004-06-02 14:27:36 Re: PostgreSQL on VMWare vs Windows vs CoLinux
Previous Message Rod Taylor 2004-06-02 03:31:50 Re: PostgreSQL and Kernel 2.6.x