Re: TSearch2 / German compound words / UTF-8

From: Alexander Presber <aljoscha(at)weisshuhn(dot)de>
To: pgsql-general(at)postgresql(dot)org
Cc: Teodor Sigaev <teodor(at)sigaev(dot)ru>, Henning Spjelkavik <henning(at)spjelkavik(dot)net>
Subject: Re: TSearch2 / German compound words / UTF-8
Date: 2006-02-17 14:36:45
Message-ID: 7C945F17-1564-4232-BADE-F61D9D7395F2@weisshuhn.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello,

Thanks for your efforts, I still don't get it to work.
I now tried the norwegian example. My encoding is ISO-8859 (I never
used UTF-8, because I thought it would be slower, the thread name is
a bit misleading).

So I am using an ISO-8859-9 database:

~/cvs/ssd% psql -l

Name | Eigentümer | Kodierung
-----------+------------+-----------
postgres | postgres | LATIN9
tstest | aljoscha | LATIN9

and a norwegian, ISO-8859 encoded dictionary and aff-file:

~% file tsearch/dict/ispell_no/norwegian.dict
tsearch/dict/ispell_no/norwegian.dict: ISO-8859 C program text
~% file tsearch/dict/ispell_no/norwegian.aff
tsearch/dict/ispell_no/norwegian.aff: ISO-8859 English text

the aff-file contains the lines:

compoundwords controlled z
...
# to compounds only:
flag ~\\:
[^S] > S

and the dictionary containins:

overtrekk/BCW\z

(meaning: word can be compound part, intermediary "s" is allowed)

My configuration is:

tstest=# SELECT * FROM tsearch2.pg_ts_cfg;
ts_name | prs_name | locale
-----------+----------+------------
simple | default | de_DE(at)euro
german | default | de_DE(at)euro
norwegian | default | de_DE(at)euro

Now the test:

tstest=# SELECT tsearch2.lexize('ispell_no','overtrekksgrill');
lexize
--------

(1 Zeile)

BUT:

tstest=# SELECT tsearch2.lexize('ispell_no','overtrekkgrill');
lexize
------------------------------------
{over,trekk,grill,overtrekk,grill}
(1 Zeile)

It simply doesn't work. No UTF-8 is involved.

Sincerely yours,

Alexander Presber

P.S.: Henning: Sorry for bothering you with the CC, just ignore it,
if you like.

Am 27.01.2006 um 18:17 schrieb Teodor Sigaev:

> contrib_regression=# insert into pg_ts_dict values (
> 'norwegian_ispell',
> (select dict_init from pg_ts_dict where
> dict_name='ispell_template'),
> 'DictFile="/usr/local/share/ispell/norsk.dict" ,'
> 'AffFile ="/usr/local/share/ispell/norsk.aff"',
> (select dict_lexize from pg_ts_dict where
> dict_name='ispell_template'),
> 'Norwegian ISpell dictionary'
> );
> INSERT 16681 1
> contrib_regression=# select lexize('norwegian_ispell','politimester');
> lexize
> ------------------------------------------
> {politimester,politi,mester,politi,mest}
> (1 row)
>
> contrib_regression=# select lexize
> ('norwegian_ispell','sjokoladefabrikk');
> lexize
> --------------------------------------
> {sjokoladefabrikk,sjokolade,fabrikk}
> (1 row)
>
> contrib_regression=# select lexize
> ('norwegian_ispell','overtrekksgrilldresser');
> lexize
> -------------------------
> {overtrekk,grill,dress}
> (1 row)
> % psql -l
> List of databases
> Name | Owner | Encoding
> --------------------+--------+----------
> contrib_regression | teodor | KOI8
> postgres | pgsql | KOI8
> template0 | pgsql | KOI8
> template1 | pgsql | KOI8
> (4 rows)
>
>
> I'm afraid that UTF-8 problem. We just committed in CVS HEAD
> multibyte support for tsearch2, so you can try it.
>
> Pls, notice, the dict, aff stopword files should be in server
> encoding. Snowball sources for german (and other) in UTF8 can be
> founded in http://snowball.tartarus.org/dist/libstemmer_c.tgz
>
> To all: May be, we should put all snowball's stemmers (for all
> available languages and encodings) to tsearch2 directory?
>
> --
> Teodor Sigaev E-mail:
> teodor(at)sigaev(dot)ru
> WWW: http://
> www.sigaev.ru/

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2006-02-17 15:00:42 Re: Implicit conversion from string to timestamp
Previous Message Tom Lane 2006-02-17 14:29:51 Re: Fixing up a corrupted toast table