From: | Hannes Dorbath <light(at)theendofthetunnel(dot)de> |
---|---|
To: | pgsql-general(at)postgresql(dot)org |
Subject: | TSearch2 / German compound words / UTF-8 |
Date: | 2005-11-23 09:57:34 |
Message-ID: | dm1ece$2gb5$1@news.hub.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Hi,
I'm on PG 8.0.4, initDB and locale set to de_DE.UTF-8, FreeBSD.
My TSearch config is based on "Tsearch2 and Unicode/UTF-8" by Markus
Wollny (http://tinyurl.com/a6po4)
The following files are used:
http://hannes.imos.net/german.med [UTF-8]
http://hannes.imos.net/german.aff [ANSI]
http://hannes.imos.net/german.stop [UTF-8]
http://hannes.imos.net/german.stop.ispell [UTF-8]
german.med is from "ispell-german-compound.tar.gz", available on the
TSearch2 site, recoded to UTF-8.
The first problem is with german compound words and does not have to do
anything with UTF-8:
In german often an "s" is used to "link" two words into an compound
word. This is true for many german compound words. TSearch/ispell is not
able to break those words up, only exact matches work.
An example with "Produktionsintervall" (production interval):
fts=# SELECT ts_debug('Produktionsintervall');
ts_debug
--------------------------------------------------------------------------------------------------
(default_german,lword,"Latin
word",Produktionsintervall,"{de_ispell,de}",'produktionsintervall')
Tsearch/isepll is not able to break this word into parts, because of the
"s" in "Produktion/s/intervall". Misspelling the word as
"Produktionintervall" fixes it:
fts=# SELECT ts_debug('Produktionintervall');
ts_debug
---------------------------------------------------------------------------------------------------------------------
(default_german,lword,"Latin
word",Produktionintervall,"{de_ispell,de}","'ion' 'produkt' 'intervall'
'produktion'")
How can I fix this / get TSearch to remove/stem the last "s" on a word
before (re-)searching the dict? Can I modify my dict or hack something
else? This is a bit of a show stopper :/
The second thing is with UTF-8:
I know there is no, or no full support yet, but I need to get it as good
as it's possible /now/. Is there anything in CVS that I might be able to
backport to my version or other tips? My setup works, as for the dict
and the stop word files, but I fear the stemming and mapping of umlauts
and other special chars does not as it should. I tried recoding the
german.aff to UTF-8 as well, but that breaks it with an regex error
sometimes:
fts=# SELECT ts_debug('dass');
ERROR: Regex error in '[^sãŸ]$': brackets [] not balanced
CONTEXT: SQL function "ts_debug" statement 1
This seems while it tries to map ss to ß, but anyway, I fear, I didn't
anything good with that.
As suggested in the "Tsearch2 and Unicode/UTF-8" article I have a second
snowball dict. The first lines of the stem.h I used start with:
> extern struct SN_env * german_ISO_8859_1_create_env(void);
So I guess this will not work exactly well with UTF-8 ;p Is there any
other stem.h I could use? Google hasn't returned much for me :/
Thanks for reading and all our time. I'll consider the donate button
after I get this working ;/
--
Regards,
Hannes Dorbath
From | Date | Subject | |
---|---|---|---|
Next Message | Martijn van Oosterhout | 2005-11-23 10:05:39 | Re: PREPARE in bash scripts |
Previous Message | A.j. Langereis | 2005-11-23 09:38:03 | PREPARE in bash scripts |