Re: Tsearch2 Dutch snowball stemmer in PG8.1

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Alban Hertroys <a(dot)hertroys(at)magproductions(dot)nl>
Cc: Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Tsearch2 Dutch snowball stemmer in PG8.1
Date: 2007-10-03 15:10:28
Message-ID: Pine.LNX.4.64.0710031859151.3304@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Wed, 3 Oct 2007, Alban Hertroys wrote:

> Oleg Bartunov wrote:
>> Alban,
>>
>> the documentation you're refereed on is for upcoming 8.3 release.
>> For 8.1 and 8.2 you need to do all machinery by hand. It's not
>> difficult, for example:
>
> Thanks Oleg.
> I think I managed to do this right, although I had to google for some of
> the files (we don't have ispell installed).
>
> You also seem to have mixed russion and english dictionaries in your
> example, I'm not sure that was on purpose?

yes, we index mixed content

>
> Anyway, I changed your example to use dutch dictionaries and locale
> where I thought it applicable, and I got something working apparently.
> Quite some guess work was involved, so I have a few questions left.
>
> The only odd thing is that to_tsvector('dutch', 'some dutch text') now
> returns '|' for stop words...

Could you packed your dictionary files and .sql, so we look on them in
spare time.

>
> For example:
> select to_tsvector('nederlands', 'De beste stuurlui staan aan wal');
> to_tsvector
> ------------------------------------------------
> '|':1,5 'bes':2 'wal':6 'staan':4 'stuurlui':3
>
>
> A minor nit... You ended the script with a hidden commit (END;). I would
> have preferred to experiment with the results a bit before commiting...

this is up to you. It was just an example

>
> I mixed in a few questions below, if you could answer them please?
>
>> -- sample tsearch2 configuration for search.postgresql.org
>> -- Creates configuration 'pg' - default, should match server's locale !!!
>> -- Change 'ru_RU.UTF-8'
>>
>> begin;
>>
>> -- create special (default) configuration 'pg'
>> update pg_ts_cfg set locale=NULL where locale = 'ru_RU.UTF-8';
>
> I suppose this disables a possibly existing stemmer for that locale?

no, it's just to have one (default) configuration 'pg' for
locale 'ru_RU.UTF-8'. You can skip this.

>
>> insert into pg_ts_cfg values('pg','default','ru_RU.UTF8');
>>
>> -- register 'pg_dict' dictionary using synonym template
>> -- postgres pg
>> -- pgsql pg
>> -- postgresql pg
>> insert into pg_ts_dict
>> (select 'pg_dict',dict_init,
>> '/usr/local/pgsql-dev/share/contrib/pg_dict.txt',
>> dict_lexize, 'pg-specific dictionary'
>> from pg_ts_dict
>> where dict_name='synonym'
>> );
>>
>> -- register ispell dictionary, check paths and stop words
>> -- I used iconv for english files, since there are some cyrillic stuff
>> insert into pg_ts_dict
>> (SELECT 'en_ispell', dict_init,
>> 'DictFile="/usr/local/share/dicts/ispell/utf8/english-utf8.dict",'
>> 'AffFile="/usr/local/share/dicts/ispell/utf8/english-utf8.aff",'
>> 'StopFile="/usr/local/share/dicts/ispell/utf8/english-utf8.stop"',
>> dict_lexize
>> FROM pg_ts_dict
>> WHERE dict_name = 'ispell_template'
>> );
>
> I actually use a .lat file here. I have no idea whether that's
> compatible (but it appears to have worked).

it's just filenames, no matter (for 8.1,8.2)
>
> I got my .lat and .aff files from:
> http://fmg-www.cs.ucla.edu/geoff/ispell-dictionaries.html#Dutch-dicts

You can use myspell dictionaries.

>
> My stop words file is from:
> http://snowball.tartarus.org/algorithms/dutch/stop.txt
>
>> -- use the same stop-word list as 'en_ispell' dictionary
>> UPDATE pg_ts_dict set dict_initoption='/usr/local/share/dicts/english.stop'
>> where dict_name='en_stem';
>
> Why change the stop words for the English dictionary? I skipped this
> step. Is that right?

I wanted to have the same list of stop words for ispell and snowball.

>
>> -- default token<->dicts mappings
>> insert into pg_ts_cfgmap select 'pg', tok_alias, dict_name from
>> public.pg_ts_cfgmap where ts_name='default';
>>
>> -- modify mappings for latin words for configuration 'pg'
>> update pg_ts_cfgmap set dict_name = '{pg_dict,en_ispell,en_stem}'
>> where tok_alias in ( 'lword', 'lhword', 'lpart_hword' )
>> and ts_name = 'pg';
>>
>> -- we won't index/search some tokens
>> update pg_ts_cfgmap set dict_name = NULL
>> --where tok_alias in ('email', 'url', 'sfloat', 'uri', 'float','word')
>> where tok_alias in ('email', 'url', 'sfloat', 'uri', 'float')
>> and ts_name = 'pg';
>>
>> end;
>>
>> -- testing
>>
>> select * from ts_debug('
>> PostgreSQL, the highly scalable, SQL compliant, open source
>> object-relational
>> database management system, is now undergoing beta testing of the next
>> version of our software: PostgreSQL 8.2.
>> ');
>>
>>
>> Oleg
>
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru)
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Oleg Bartunov 2007-10-03 15:12:27 Re: Tsearch2 Dutch snowball stemmer in PG8.1
Previous Message Tom Lane 2007-10-03 14:20:49 Re: pg_cancel_backend() does not work with buzz queries