Re: tsearch2, ispell, utf-8 and german special characters

From: "Markus Wollny" <Markus(dot)Wollny(at)computec(dot)de>
To: <pgsql-general(at)postgresql(dot)org>, <openfts-general(at)lists(dot)sourceforge(dot)net>
Subject: Re: tsearch2, ispell, utf-8 and german special characters
Date: 2004-07-21 12:23:49
Message-ID: 2266D0630E43BB4290742247C891057505BF2E79@dozer.computec.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi!

Okay, I changed locale via initdb and I've got it working to some extent now.

Now I've got some problem with the ISpell-dictionary and the stopwords-list. Both have been compiled with de_DE.utf8-locale.

When I
SELECT to_tsvector('default_german',
'Jeden Tag wirst Du ein bisschen älter, aber Du lernst');

I get
'tag':2 'aber':8 'eint':5 'lernen':10 'älter':7 'bisschen':6

I've got three questions regarding this result:
1. both 'ein' and 'aber' are included in the stopwords-file, but they show up in the result, whereas 'jeden', 'wirst', 'du' are removed correctly - why is the stopword-list ignored for the former two?
2. why does 'ein' appear as 'eint'?
3. is this result actually no cause of alarm, so can I deploy tsearch2 to my production databases nevertheless?

I'm using http://j3e.de/ispell/igerman98/dict/igerman98-20030222.tar.bz2 (the latest version of Heinz Knutzen's dictionary) and I've edited its Makefile to use de_DE.utf8 in the locale settings; all.words was indeed the file used to generate the hash, so I guess that I can now be more or less sure that I've actually followed the instructions in the docs precisely. I dropped any references to the german snowball stemmer dictionary which I had configured as fallback, so currently there's only this one dictionary configured for ts_name default_german and tok_alias lhword, lpard_hword, lword (the remaining tog_alias entries are set to use the simple dictionary).

Kind regards

Markus

> -----Ursprüngliche Nachricht-----
> Von: Peter Eisentraut [mailto:peter_e(at)gmx(dot)net]
> Gesendet: Mittwoch, 21. Juli 2004 12:17
> An: Markus Wollny
> Cc: pgsql-general(at)postgresql(dot)org;
> openfts-general(at)lists(dot)sourceforge(dot)net
> Betreff: Re: AW: [GENERAL] tsearch2, ispell, utf-8 and german
> special characters
>
> Am Mittwoch, 21. Juli 2004 09:36 schrieb Markus Wollny:
> > Thanks for your answer. It's probably not sufficient to adjust the
> > current locale settings of the system, so I'll have to
> dump, re-initdb
> > and reload - am I correct or is there some procedure
> involving less downtime than that?
>
> Sorry, no.
>
> --
> Peter Eisentraut
> http://developer.postgresql.org/~petere/
>

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Oleg Bartunov 2004-07-21 13:34:01 Re: tsearch2, ispell, utf-8 and german special characters
Previous Message Paolo Tavalazzi 2004-07-21 11:47:02 help me