Re: Tsearch2 custom dictionaries

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: psql-mail(at)freeuk(dot)com
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Tsearch2 custom dictionaries
Date: 2003-08-07 14:31:07
Message-ID: Pine.GSO.4.56.0308071816320.17880@ra.sai.msu.su
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Thu, 7 Aug 2003 psql-mail(at)freeuk(dot)com wrote:

> Part1.
>
> I have created a dictionary called 'webwords' which checks all words
> and curtails them to 300 chars (for now)
>
> after running
> make
> make install
>
> I then copied the lib_webwords.so into my $libdir
>
> I have run
>
> psql mybd < dict_webwords.sql
>
> The tutorial shows how to install the intdict for integer types. How
> should i install my custom dictionary?

Once you did 'psql mybd < dict_webwords.sql' you should be able use it :)
Test it :
select lexize('webwords','some_web_word');

Did you read http://www.sai.msu.su/~megera/oddmuse/index.cgi/Gendict

>
>
> Part2.
>
> The dictionary I am trying to create is to be used for searching
> multilingual text. My aim is to have fast search over all text, but
> ignore binary encoded data which is also present. (i will probably move
> to ignoring long words in the text eventually).
> What is the best approach to tackle this problem?
> As the text can be multilingual I don't think stemming is possible?

You're right. I'm afraid you need UTF database, but tsearch2 isn't
UTF-8 compatible :(

> I also need to include many none-standard words in the index such as
> urls and message ID's contained in the text.
>

What's message ID ? Integer ? it's already recognized by parser.

try
select * from token_type();

Also, last version of tsearch2 (for 7.3 grab from
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/,
for 7.4 - available from CVS)
has rather useful function - ts_debug

apod=# select * from ts_debug('http://www.sai.msu.su/~megera');
ts_name | tok_type | description | token | dict_name | tsvector
---------+----------+-------------+----------------+-----------+------------------
simple | host | Host | www.sai.msu.su | {simple} | 'www.sai.msu.su'
simple | lword | Latin word | megera | {simple} | 'megera'
(2 rows)

> I get the feeling that building these indexs will by no means be an
> easy task so any suggestions will be gratefully recieved!
>

You may write your own parser, at last. Some info about parser API:
http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_in_Brief

> Thanks...
>
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Shridhar Daithankar 2003-08-07 14:40:25 Re: crashing Xeon?
Previous Message Joel Burton 2003-08-07 14:29:01 Re: How to recognize PG SQL files?