Re: [tsvector] to_tsvector called multiple times

From: "Sven R(dot) Kunze" <srkunze(at)tbz-pariv(dot)de>
To: Albe Laurenz <laurenz(dot)albe(at)wien(dot)gv(dot)at>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: [tsvector] to_tsvector called multiple times
Date: 2015-05-26 09:47:43
Message-ID: 5564413F.605@tbz-pariv.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Thanks Albe for that detailed answer.

On 26.05.2015 11:01, Albe Laurenz wrote:
> Sven R. Kunze wrote:
>> the following stemming results made me curious:
>>
>> select to_tsvector('german', 'systeme'); > 'system':1
>> select to_tsvector('german', 'systemes'); > 'system':1
>> select to_tsvector('german', 'systems'); > 'system':1
>> select to_tsvector('german', 'systemen'); > 'system':1
>> select to_tsvector('german', 'system'); > 'syst':1
>>
>>
>> First of all, this seems to be a bug in the German stemmer. Where can I
>> fix it?
> As far as I understand, the stemmer is not perfect, it is just a "best
> effort" at German stemming. It does not have a dictionary of valid German
> words, but uses an algorithm based on only the occurring letters.
>
> This web page describes the algorithm:
> http://snowball.tartarus.org/algorithms/german/stemmer.html
> I guess that the Snowball folks (and PostgreSQL) would be interested
> if you could come up with a better algorithm.

Thanks for that hint. I will go to
https://github.com/snowballstem/snowball/issues and try to explain my
problem there.

However, are you sure, I am using snowball? Maybe, I am reading the
documenation wrong:
http://www.postgresql.org/docs/9.3/static/textsearch-dictionaries.html
but it seems as it depends on which packages (ispell, hunspell, myspell,
snowball + corresponding languages) my system has installed.

Is there an easy way to determine which of these packages PostgreSQL
uses AND what for?

> In this specific case, the stemmer goes wrong because "System" is a
> foreign word whose ending is atypical for German. The algorithm cannot
> distinguish between "System" and, say, "lautem" or "bestem".
>
>> Second, and more importantly, as I understand it, the stemmed version of
>> a word should be considered normalized. That is, all other versions of
>> that stem should be mapped to it as well. The interesting problem here
>> is that PostgreSQL maps the stem itself ('system') to a completely
>> different stem ('syst').
>>
>> Should a stem not remain stable even when to_tsvector is called on it
>> multiple times?
> That's a possible position, but consider that a stem is not necessarily
> a valid German word. If you treat it as a German word (by stemming it),
> the results might not be what you desire.
>
> For example:
>
> test=> select to_tsvector('german', 'linsen');
> to_tsvector
> -------------
> 'lins':1
> (1 row)
>
> test=> select to_tsvector('german', 'lins');
> to_tsvector
> -------------
> 'lin':1
> (1 row)

Sure. That might be the problem. It occurs to me that stems (if detected
as such) should be left alone.
In case a stem is real German word, it should be stemmed to itself anyway
If not, it might help not to stem in order to avoid errors.

> I guess that your real problem here is that a search for "system"
> will not find "systeme", which is indeed unfortunate.
> But until somebody can come up with a better stemming algorithm, cases
> like that can always occur.
>
> Yours,
> Laurenz Albe
This might pose a problem in the future of course. Thanks for pointing
this out as well.

Regards,

--
Sven R. Kunze
TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz
Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920
e-mail: srkunze(at)tbz-pariv(dot)de
web: www.tbz-pariv.de

Geschäftsführer: Dr. Reiner Wohlgemuth
Sitz der Gesellschaft: Chemnitz
Registergericht: Chemnitz HRB 8543

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Albe Laurenz 2015-05-26 10:09:15 Re: [tsvector] to_tsvector called multiple times
Previous Message Sven R. Kunze 2015-05-26 09:29:53 Re: [tsvector] to_tsvector called multiple times