Quick Links

snowball ASCII stemmer configuration

From:	Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	snowball ASCII stemmer configuration
Date:	2020-06-16 08:16:21
Message-ID:	1f74d8ed-bb8b-256c-ac09-4e5101be5a50@2ndquadrant.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

While I was updating the snowball code, I noticed something strange. In
src/backend/snowball/Makefile:

# first column is language name and also name of dictionary for
not-all-ASCII
# words, second is name of dictionary for all-ASCII words
# Note order dependency: use of some other language as ASCII dictionary
# must come after creation of that language
LANGUAGES= \
arabic arabic \
basque basque \
catalan catalan \
etc.

There are two cases where these two columns are not the same:

hindi english \
russian english \

The second one is old; the first one I added using the second one as
example. But I wonder what the rationale for this is. Maybe for hindi
one could make some kind of cultural argument, but for russian this
seems entirely arbitrary. Perhaps using "simple" would be more sound here.

Moreover, AFAIK, the following other languages do not use Latin-based
alphabets:

arabic arabic \
greek greek \
nepali nepali \
tamil tamil \

So I wonder by what rationale they use their own stemmer for the ASCII
fallback, which is probably not going to produce anything significant.

What's the general idea here?

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Responses

Re: snowball ASCII stemmer configuration at 2020-06-16 13:53:46 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Vik Fearing	2020-06-16 08:28:55	Re: Infinities in type numeric
Previous Message	Juan José Santamaría Flecha	2020-06-16 08:10:23	Re: TAP tests and symlinks on Windows