Quick Links

Full Text Search dictionary issues

From:	Howard Rogers <hjr(at)diznix(dot)com>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Full Text Search dictionary issues
Date:	2010-07-15 23:20:38
Message-ID:	AANLkTil1tzK713pCJHz3i0Dy-a345zsOkIjYj-TxJinW@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

I have 10 million rows in a table, with full text index created on one
of the columns. I submit this query:

ims=# select count(*) from search_rm
ims-# where to_tsvector('english', textsearch)
ims-# @@ to_tsquery('english', 'woman & beach & ball');
count
-------
646
(1 row)
Time: 107.570 ms

...and those are excellent times. But if I alter the query to read:

ims=# select count(*) from search_rm
where to_tsvector('english', textsearch)
@@ to_tsquery('english', 'woman & beach & ftx1');
count
-------
38343
(1 row)
Time: 640.985 ms

...then, as you see, it slows the query down by a factor of about 6,
which is not so good! The problem is that we need to be able to search
for "ftx1", since that's a flag we put in our document records to tell
us the file type, and we need to be able to retrieve different file
types at different times.

Now, I *think* the problem is that 'ftx1' is not being treated as
though it were a proper word:

ims=# select * from ts_debug('english','woman ball ftx1');
alias | description | token | dictionaries |
dictionary | lexemes
-----------+--------------------------+-------+----------------+--------------+---------
asciiword | Word, all ASCII | woman | {english_stem} |
english_stem | {woman}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | ball | {english_stem} |
english_stem | {ball}
blank | Space symbols | | {} | |
numword | Word, letters and digits | ftx1 | {simple} |
simple | {ftx1}
(5 rows)

Instead of being an asciiword that uses the english-stem dictionary,
it 'ftx1' gets regarded as a numword in the simple dictionary.

If I simply replace "ftxa" for "ftx1", it *is* then regarded as an
asciiword, and performance of the original query reverts to being just
fine, too:

ims=# select * from ts_debug('english','woman ball ftxa');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+-------+----------------+--------------+---------
asciiword | Word, all ASCII | woman | {english_stem} | english_stem | {woman}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | ball | {english_stem} | english_stem | {ball}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | ftxa | {english_stem} | english_stem | {ftxa}

ims=# select count(*) from search_rm
where to_tsvector('english', textsearch)
@@ to_tsquery('english', 'woman & beach & ftxa');
count
-------
0
(1 row)

Time: 88.603 ms

As you can see, 88ms for a search with 'ftxa' compared to 600+ms for
one with 'ftx1'.

I should mention that we have about 45 different 'nonsense word' flags
we use for all sorts of different purposes, such as telling us which
region a document is visible in, whether it is in portrait or
landscape mode and so on. All of these flag-words take the form of
zzzz1, or yyyy2 and so on. So there's a lot of these things causing
the problem, not just 'ftx1' specifically.

My question is, then, what I can do to stop this slowdown? Is there
some way to add 'ftx1' (and the others) as a word in the english_stem
dictionary so that it gets regarded as an asciiword, for example? Or
is there something else I can do to address the problem?

I'm fairly new to PostgreSQL's full text search. I've read Chapter 12
of the doco with rapt attention, but I don't see anything that leaps
at me as a fix for this issue. All help greatefully received,
therefore, and apologies in advance if this is a bit of a newbie
question.

Regards
HJR

Responses

Re: Full Text Search dictionary issues at 2010-07-16 00:23:15 from Tom Lane

Browse pgsql-general by date

	From	Date	Subject
Next Message	Tom Lane	2010-07-16 00:23:15	Re: Full Text Search dictionary issues
Previous Message	Tom Lane	2010-07-15 16:58:33	Re: Idle In Transaction