Re: snowball ASCII stemmer configuration

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: snowball ASCII stemmer configuration
Date: 2020-06-16 15:40:37
Message-ID: 1333705.1592322037@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com> writes:
> I am a bit surprised to see that you are right about this, because non-latin languages often have transliteration/romanization schemes for writing the language in the Latin alphabet, developed before computers had wide spread adoption of non-ASCII character sets, and still in use today for text messaging. I expected to find stemming rules for transliterated words, but can't find any indication of that, neither in the postgres sources, nor in the snowball sources I pulled from their repo. Is there some architectural separation of stemming from transliteration such that we'd never need to worry about it? If snowball ever published stemmers for transliterated text, we might have to revisit this issue, but for now your proposed change sounds fine to me.

Agreed, if the Snowball stemmers worked on romanized texts then the
situation would be different. But they don't, AFAICS. Don't know
if that is architectural, or a policy decision, or just lack of
round tuits.

The thing that I actually find a bit shaky in this area is our
architectural decision to route words to different dictionaries
depending on whether they are all-ASCII or not. AIUI that was
done purely on the basis of the Russian/English case; it would
fail badly if say you wanted to separate Russian from French.
However, I have no great desire to revisit that design right now.

regards, tom lane

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2020-06-16 15:46:38 Re: Review for GetWALAvailability()
Previous Message Mark Dilger 2020-06-16 15:25:03 Re: snowball ASCII stemmer configuration