Re: [tsvector] to_tsvector called multiple times

From: Albe Laurenz <laurenz(dot)albe(at)wien(dot)gv(dot)at>
To: "'Sven R(dot) Kunze *EXTERN*'" <srkunze(at)tbz-pariv(dot)de>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: [tsvector] to_tsvector called multiple times
Date: 2015-05-26 09:01:44
Message-ID: A737B7A37273E048B164557ADEF4A58B36615EEF@ntex2010i.host.magwien.gv.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Sven R. Kunze wrote:
> the following stemming results made me curious:
>
> select to_tsvector('german', 'systeme'); > 'system':1
> select to_tsvector('german', 'systemes'); > 'system':1
> select to_tsvector('german', 'systems'); > 'system':1
> select to_tsvector('german', 'systemen'); > 'system':1
> select to_tsvector('german', 'system'); > 'syst':1
>
>
> First of all, this seems to be a bug in the German stemmer. Where can I
> fix it?

As far as I understand, the stemmer is not perfect, it is just a "best
effort" at German stemming. It does not have a dictionary of valid German
words, but uses an algorithm based on only the occurring letters.

This web page describes the algorithm:
http://snowball.tartarus.org/algorithms/german/stemmer.html
I guess that the Snowball folks (and PostgreSQL) would be interested
if you could come up with a better algorithm.

In this specific case, the stemmer goes wrong because "System" is a
foreign word whose ending is atypical for German. The algorithm cannot
distinguish between "System" and, say, "lautem" or "bestem".

> Second, and more importantly, as I understand it, the stemmed version of
> a word should be considered normalized. That is, all other versions of
> that stem should be mapped to it as well. The interesting problem here
> is that PostgreSQL maps the stem itself ('system') to a completely
> different stem ('syst').
>
> Should a stem not remain stable even when to_tsvector is called on it
> multiple times?

That's a possible position, but consider that a stem is not necessarily
a valid German word. If you treat it as a German word (by stemming it),
the results might not be what you desire.

For example:

test=> select to_tsvector('german', 'linsen');
to_tsvector
-------------
'lins':1
(1 row)

test=> select to_tsvector('german', 'lins');
to_tsvector
-------------
'lin':1
(1 row)

I guess that your real problem here is that a search for "system"
will not find "systeme", which is indeed unfortunate.
But until somebody can come up with a better stemming algorithm, cases
like that can always occur.

Yours,
Laurenz Albe

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Oleg Bartunov 2015-05-26 09:05:50 Re: [tsvector] to_tsvector called multiple times
Previous Message Sven R. Kunze 2015-05-26 08:18:35 [tsvector] to_tsvector called multiple times