From: | Howard Rogers <hjr(at)diznix(dot)com> |
---|---|
To: | pgsql-general(at)postgresql(dot)org |
Subject: | Full Text Search dictionary issues |
Date: | 2010-07-15 23:20:38 |
Message-ID: | AANLkTil1tzK713pCJHz3i0Dy-a345zsOkIjYj-TxJinW@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
I have 10 million rows in a table, with full text index created on one
of the columns. I submit this query:
ims=# select count(*) from search_rm
ims-# where to_tsvector('english', textsearch)
ims-# @@ to_tsquery('english', 'woman & beach & ball');
count
-------
646
(1 row)
Time: 107.570 ms
...and those are excellent times. But if I alter the query to read:
ims=# select count(*) from search_rm
where to_tsvector('english', textsearch)
@@ to_tsquery('english', 'woman & beach & ftx1');
count
-------
38343
(1 row)
Time: 640.985 ms
...then, as you see, it slows the query down by a factor of about 6,
which is not so good! The problem is that we need to be able to search
for "ftx1", since that's a flag we put in our document records to tell
us the file type, and we need to be able to retrieve different file
types at different times.
Now, I *think* the problem is that 'ftx1' is not being treated as
though it were a proper word:
ims=# select * from ts_debug('english','woman ball ftx1');
alias | description | token | dictionaries |
dictionary | lexemes
-----------+--------------------------+-------+----------------+--------------+---------
asciiword | Word, all ASCII | woman | {english_stem} |
english_stem | {woman}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | ball | {english_stem} |
english_stem | {ball}
blank | Space symbols | | {} | |
numword | Word, letters and digits | ftx1 | {simple} |
simple | {ftx1}
(5 rows)
Instead of being an asciiword that uses the english-stem dictionary,
it 'ftx1' gets regarded as a numword in the simple dictionary.
If I simply replace "ftxa" for "ftx1", it *is* then regarded as an
asciiword, and performance of the original query reverts to being just
fine, too:
ims=# select * from ts_debug('english','woman ball ftxa');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+-------+----------------+--------------+---------
asciiword | Word, all ASCII | woman | {english_stem} | english_stem | {woman}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | ball | {english_stem} | english_stem | {ball}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | ftxa | {english_stem} | english_stem | {ftxa}
ims=# select count(*) from search_rm
where to_tsvector('english', textsearch)
@@ to_tsquery('english', 'woman & beach & ftxa');
count
-------
0
(1 row)
Time: 88.603 ms
As you can see, 88ms for a search with 'ftxa' compared to 600+ms for
one with 'ftx1'.
I should mention that we have about 45 different 'nonsense word' flags
we use for all sorts of different purposes, such as telling us which
region a document is visible in, whether it is in portrait or
landscape mode and so on. All of these flag-words take the form of
zzzz1, or yyyy2 and so on. So there's a lot of these things causing
the problem, not just 'ftx1' specifically.
My question is, then, what I can do to stop this slowdown? Is there
some way to add 'ftx1' (and the others) as a word in the english_stem
dictionary so that it gets regarded as an asciiword, for example? Or
is there something else I can do to address the problem?
I'm fairly new to PostgreSQL's full text search. I've read Chapter 12
of the doco with rapt attention, but I don't see anything that leaps
at me as a fix for this issue. All help greatefully received,
therefore, and apologies in advance if this is a bit of a newbie
question.
Regards
HJR
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2010-07-16 00:23:15 | Re: Full Text Search dictionary issues |
Previous Message | Tom Lane | 2010-07-15 16:58:33 | Re: Idle In Transaction |