From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: snowball ASCII stemmer configuration |
Date: | 2020-06-16 14:37:17 |
Message-ID: | 1301915.1592318237@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
I wrote:
> Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com> writes:
>> Moreover, AFAIK, the following other languages do not use Latin-based
>> alphabets:
>> arabic arabic \
>> greek greek \
>> nepali nepali \
>> tamil tamil \
> Hmm. I think all of those entries are ones that got added by me while
> absorbing post-2007 Snowball updates, and I confess that I did not think
> about this point. Maybe these should be changed.
After further reflection, I think these are indeed mistakes and we should
change them all. The argument for the Russian/English case, AIUI, is
"if we come across an all-ASCII word, it is most certainly not Russian,
and the most likely Latin-based language is English". Given the world
as it is, I think the same argument works for all non-Latin-alphabet
languages. Obviously specific applications might have a different idea
of the best fallback language, but that's why we let users make their
own text search configurations. For general-purpose use, falling back
to English seems reasonable. And we can be dead certain that applying
a Greek stemmer to an ASCII word will do nothing useful, so the
configuration choice shown above is unhelpful.
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Georgios | 2020-06-16 14:51:31 | Use TableAm API in pg_table_size |
Previous Message | Tatsuo Ishii | 2020-06-16 14:36:17 | Re: Transactions involving multiple postgres foreign servers, take 2 |