Re: Tsearch2 Dutch snowball stemmer in PG8.1

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Alban Hertroys <a(dot)hertroys(at)magproductions(dot)nl>
Cc: Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Tsearch2 Dutch snowball stemmer in PG8.1
Date: 2007-10-03 12:32:55
Message-ID: Pine.LNX.4.64.0710031630410.3304@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Alban,

the documentation you're refereed on is for upcoming 8.3 release.
For 8.1 and 8.2 you need to do all machinery by hand. It's not
difficult, for example:

-- sample tsearch2 configuration for search.postgresql.org
-- Creates configuration 'pg' - default, should match server's locale !!!
-- Change 'ru_RU.UTF-8'

begin;

-- create special (default) configuration 'pg'
update pg_ts_cfg set locale=NULL where locale = 'ru_RU.UTF-8';
insert into pg_ts_cfg values('pg','default','ru_RU.UTF8');

-- register 'pg_dict' dictionary using synonym template
-- postgres pg
-- pgsql pg
-- postgresql pg
insert into pg_ts_dict
(select 'pg_dict',dict_init,
'/usr/local/pgsql-dev/share/contrib/pg_dict.txt',
dict_lexize, 'pg-specific dictionary'
from pg_ts_dict
where dict_name='synonym'
);

-- register ispell dictionary, check paths and stop words
-- I used iconv for english files, since there are some cyrillic stuff
insert into pg_ts_dict
(SELECT 'en_ispell', dict_init,
'DictFile="/usr/local/share/dicts/ispell/utf8/english-utf8.dict",'
'AffFile="/usr/local/share/dicts/ispell/utf8/english-utf8.aff",'
'StopFile="/usr/local/share/dicts/ispell/utf8/english-utf8.stop"',
dict_lexize
FROM pg_ts_dict
WHERE dict_name = 'ispell_template'
);

-- use the same stop-word list as 'en_ispell' dictionary
UPDATE pg_ts_dict set dict_initoption='/usr/local/share/dicts/english.stop'
where dict_name='en_stem';

-- default token<->dicts mappings
insert into pg_ts_cfgmap select 'pg', tok_alias, dict_name from public.pg_ts_cfgmap where ts_name='default';

-- modify mappings for latin words for configuration 'pg'
update pg_ts_cfgmap set dict_name = '{pg_dict,en_ispell,en_stem}'
where tok_alias in ( 'lword', 'lhword', 'lpart_hword' )
and ts_name = 'pg';

-- we won't index/search some tokens
update pg_ts_cfgmap set dict_name = NULL
--where tok_alias in ('email', 'url', 'sfloat', 'uri', 'float','word')
where tok_alias in ('email', 'url', 'sfloat', 'uri', 'float')
and ts_name = 'pg';

end;

-- testing

select * from ts_debug('
PostgreSQL, the highly scalable, SQL compliant, open source object-relational
database management system, is now undergoing beta testing of the next
version of our software: PostgreSQL 8.2.
');

Oleg
On Wed, 3 Oct 2007, Alban Hertroys wrote:

> Hello,
>
> I'm trying to get a Dutch snowball stemmer in Postgres 8.1, but I can't
> find how to do that.
>
> I found CREATE FULLTEXT DICTIONARY commands in the tsearch2 docs on
> http://www.sai.msu.su/~megera/postgres/fts/doc/index.html, but these
> commands are apparently not available on PG8.1.
>
> I also found the tables pg_ts_(cfg|cfgmap|dict|parser), but I have no
> idea how to add a Dutch stemmer to those.
>
> I did find some references to stem.[ch] files that were suggested to
> compile into the postgres sources, but I cannot believe that's the right
> way to do this (besides that I don't have sufficient privileges to
> install such a version).
>
> So... How do I do this?
>
> The system involved is some version of Debian Linux (2.6 kernel); are
> there any packages for a Dutch stemmer maybe?
>
> I'm in a bit of a hurry too, as we're on a tight deadline :(
>
> Regards,
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru)
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Geoffrey 2007-10-03 12:47:14 Re: reporting tools
Previous Message Alvaro Herrera 2007-10-03 12:12:36 Re: pg_cancel_backend() does not work with buzz queries