From: | Albe Laurenz <laurenz(dot)albe(at)wien(dot)gv(dot)at> |
---|---|
To: | "'Sven R(dot) Kunze *EXTERN*'" <srkunze(at)tbz-pariv(dot)de>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: [tsvector] to_tsvector called multiple times |
Date: | 2015-05-26 09:01:44 |
Message-ID: | A737B7A37273E048B164557ADEF4A58B36615EEF@ntex2010i.host.magwien.gv.at |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Sven R. Kunze wrote:
> the following stemming results made me curious:
>
> select to_tsvector('german', 'systeme'); > 'system':1
> select to_tsvector('german', 'systemes'); > 'system':1
> select to_tsvector('german', 'systems'); > 'system':1
> select to_tsvector('german', 'systemen'); > 'system':1
> select to_tsvector('german', 'system'); > 'syst':1
>
>
> First of all, this seems to be a bug in the German stemmer. Where can I
> fix it?
As far as I understand, the stemmer is not perfect, it is just a "best
effort" at German stemming. It does not have a dictionary of valid German
words, but uses an algorithm based on only the occurring letters.
This web page describes the algorithm:
http://snowball.tartarus.org/algorithms/german/stemmer.html
I guess that the Snowball folks (and PostgreSQL) would be interested
if you could come up with a better algorithm.
In this specific case, the stemmer goes wrong because "System" is a
foreign word whose ending is atypical for German. The algorithm cannot
distinguish between "System" and, say, "lautem" or "bestem".
> Second, and more importantly, as I understand it, the stemmed version of
> a word should be considered normalized. That is, all other versions of
> that stem should be mapped to it as well. The interesting problem here
> is that PostgreSQL maps the stem itself ('system') to a completely
> different stem ('syst').
>
> Should a stem not remain stable even when to_tsvector is called on it
> multiple times?
That's a possible position, but consider that a stem is not necessarily
a valid German word. If you treat it as a German word (by stemming it),
the results might not be what you desire.
For example:
test=> select to_tsvector('german', 'linsen');
to_tsvector
-------------
'lins':1
(1 row)
test=> select to_tsvector('german', 'lins');
to_tsvector
-------------
'lin':1
(1 row)
I guess that your real problem here is that a search for "system"
will not find "systeme", which is indeed unfortunate.
But until somebody can come up with a better stemming algorithm, cases
like that can always occur.
Yours,
Laurenz Albe
From | Date | Subject | |
---|---|---|---|
Next Message | Oleg Bartunov | 2015-05-26 09:05:50 | Re: [tsvector] to_tsvector called multiple times |
Previous Message | Sven R. Kunze | 2015-05-26 08:18:35 | [tsvector] to_tsvector called multiple times |