Re: tsearch2, ispell, utf-8 and german special characters

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Markus Wollny <Markus(dot)Wollny(at)computec(dot)de>
Cc: pgsql-general(at)postgresql(dot)org, openfts-general(at)lists(dot)sourceforge(dot)net
Subject: Re: tsearch2, ispell, utf-8 and german special characters
Date: 2004-07-21 13:34:01
Message-ID: Pine.GSO.4.58.0407211715340.29036@ra.sai.msu.su
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Marcus,

it'd be easier for others if you show your tsearch2 configuration.
btw, what version of pgsql and tsearch2 (any patches applied ?)
Since I don't know german I could provide a little help, but I'd like
to have some words from you when you get all things working right,
so other people would appreciate your experience.

I wouldn't use tsearch2 in production until you understand your problem and
get tsearch2 works correctly.

Oleg

On Wed, 21 Jul 2004, Markus Wollny wrote:

> Hi!
>
> Okay, I changed locale via initdb and I've got it working to some extent now.
>
> Now I've got some problem with the ISpell-dictionary and the stopwords-list. Both have been compiled with de_DE.utf8-locale.
>
> When I
> SELECT to_tsvector('default_german',
> 'Jeden Tag wirst Du ein bisschen ?lter, aber Du lernst');
>
> I get
> 'tag':2 'aber':8 'eint':5 'lernen':10 '?lter':7 'bisschen':6
>
> I've got three questions regarding this result:
> 1. both 'ein' and 'aber' are included in the stopwords-file, but they show up in the result, whereas 'jeden', 'wirst', 'du' are removed correctly - why is the stopword-list ignored for the former two?
> 2. why does 'ein' appear as 'eint'?
> 3. is this result actually no cause of alarm, so can I deploy tsearch2 to my production databases nevertheless?
>
> I'm using http://j3e.de/ispell/igerman98/dict/igerman98-20030222.tar.bz2 (the latest version of Heinz Knutzen's dictionary) and I've edited its Makefile to use de_DE.utf8 in the locale settings; all.words was indeed the file used to generate the hash, so I guess that I can now be more or less sure that I've actually followed the instructions in the docs precisely. I dropped any references to the german snowball stemmer dictionary which I had configured as fallback, so currently there's only this one dictionary configured for ts_name default_german and tok_alias lhword, lpard_hword, lword (the remaining tog_alias entries are set to use the simple dictionary).
>
> Kind regards
>
> Markus
>
> > -----Urspr?ngliche Nachricht-----
> > Von: Peter Eisentraut [mailto:peter_e(at)gmx(dot)net]
> > Gesendet: Mittwoch, 21. Juli 2004 12:17
> > An: Markus Wollny
> > Cc: pgsql-general(at)postgresql(dot)org;
> > openfts-general(at)lists(dot)sourceforge(dot)net
> > Betreff: Re: AW: [GENERAL] tsearch2, ispell, utf-8 and german
> > special characters
> >
> > Am Mittwoch, 21. Juli 2004 09:36 schrieb Markus Wollny:
> > > Thanks for your answer. It's probably not sufficient to adjust the
> > > current locale settings of the system, so I'll have to
> > dump, re-initdb
> > > and reload - am I correct or is there some procedure
> > involving less downtime than that?
> >
> > Sorry, no.
> >
> > --
> > Peter Eisentraut
> > http://developer.postgresql.org/~petere/
> >
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
> message can get through to the mailing list cleanly
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Dylan Milks 2004-07-21 13:38:07 Re: Insert images through ASP
Previous Message Markus Wollny 2004-07-21 12:23:49 Re: tsearch2, ispell, utf-8 and german special characters