From: | Alexander Presber <aljoscha(at)weisshuhn(dot)de> |
---|---|
To: | pgsql-general(at)postgresql(dot)org |
Cc: | Teodor Sigaev <teodor(at)sigaev(dot)ru>, Henning Spjelkavik <henning(at)spjelkavik(dot)net> |
Subject: | Re: TSearch2 / German compound words / UTF-8 |
Date: | 2006-02-17 14:36:45 |
Message-ID: | 7C945F17-1564-4232-BADE-F61D9D7395F2@weisshuhn.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Hello,
Thanks for your efforts, I still don't get it to work.
I now tried the norwegian example. My encoding is ISO-8859 (I never
used UTF-8, because I thought it would be slower, the thread name is
a bit misleading).
So I am using an ISO-8859-9 database:
~/cvs/ssd% psql -l
Name | Eigentümer | Kodierung
-----------+------------+-----------
postgres | postgres | LATIN9
tstest | aljoscha | LATIN9
and a norwegian, ISO-8859 encoded dictionary and aff-file:
~% file tsearch/dict/ispell_no/norwegian.dict
tsearch/dict/ispell_no/norwegian.dict: ISO-8859 C program text
~% file tsearch/dict/ispell_no/norwegian.aff
tsearch/dict/ispell_no/norwegian.aff: ISO-8859 English text
the aff-file contains the lines:
compoundwords controlled z
...
# to compounds only:
flag ~\\:
[^S] > S
and the dictionary containins:
overtrekk/BCW\z
(meaning: word can be compound part, intermediary "s" is allowed)
My configuration is:
tstest=# SELECT * FROM tsearch2.pg_ts_cfg;
ts_name | prs_name | locale
-----------+----------+------------
simple | default | de_DE(at)euro
german | default | de_DE(at)euro
norwegian | default | de_DE(at)euro
Now the test:
tstest=# SELECT tsearch2.lexize('ispell_no','overtrekksgrill');
lexize
--------
(1 Zeile)
BUT:
tstest=# SELECT tsearch2.lexize('ispell_no','overtrekkgrill');
lexize
------------------------------------
{over,trekk,grill,overtrekk,grill}
(1 Zeile)
It simply doesn't work. No UTF-8 is involved.
Sincerely yours,
Alexander Presber
P.S.: Henning: Sorry for bothering you with the CC, just ignore it,
if you like.
Am 27.01.2006 um 18:17 schrieb Teodor Sigaev:
> contrib_regression=# insert into pg_ts_dict values (
> 'norwegian_ispell',
> (select dict_init from pg_ts_dict where
> dict_name='ispell_template'),
> 'DictFile="/usr/local/share/ispell/norsk.dict" ,'
> 'AffFile ="/usr/local/share/ispell/norsk.aff"',
> (select dict_lexize from pg_ts_dict where
> dict_name='ispell_template'),
> 'Norwegian ISpell dictionary'
> );
> INSERT 16681 1
> contrib_regression=# select lexize('norwegian_ispell','politimester');
> lexize
> ------------------------------------------
> {politimester,politi,mester,politi,mest}
> (1 row)
>
> contrib_regression=# select lexize
> ('norwegian_ispell','sjokoladefabrikk');
> lexize
> --------------------------------------
> {sjokoladefabrikk,sjokolade,fabrikk}
> (1 row)
>
> contrib_regression=# select lexize
> ('norwegian_ispell','overtrekksgrilldresser');
> lexize
> -------------------------
> {overtrekk,grill,dress}
> (1 row)
> % psql -l
> List of databases
> Name | Owner | Encoding
> --------------------+--------+----------
> contrib_regression | teodor | KOI8
> postgres | pgsql | KOI8
> template0 | pgsql | KOI8
> template1 | pgsql | KOI8
> (4 rows)
>
>
> I'm afraid that UTF-8 problem. We just committed in CVS HEAD
> multibyte support for tsearch2, so you can try it.
>
> Pls, notice, the dict, aff stopword files should be in server
> encoding. Snowball sources for german (and other) in UTF8 can be
> founded in http://snowball.tartarus.org/dist/libstemmer_c.tgz
>
> To all: May be, we should put all snowball's stemmers (for all
> available languages and encodings) to tsearch2 directory?
>
> --
> Teodor Sigaev E-mail:
> teodor(at)sigaev(dot)ru
> WWW: http://
> www.sigaev.ru/
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2006-02-17 15:00:42 | Re: Implicit conversion from string to timestamp |
Previous Message | Tom Lane | 2006-02-17 14:29:51 | Re: Fixing up a corrupted toast table |