Re: BUG #10589: hungarian.stop file spelling error

From: Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <kgrittn(at)ymail(dot)com>, "zsoros(at)gmail(dot)com" <zsoros(at)gmail(dot)com>, "pgsql-bugs(at)postgresql(dot)org" <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #10589: hungarian.stop file spelling error
Date: 2014-06-11 03:24:07
Message-ID: 5397CBD7.2070606@archidevsys.co.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On 11/06/14 15:09, Tom Lane wrote:
> I wrote:
>>> [ we seem to have gotten a misencoded version of hungarian.stop ]
>> Actually, it looks like things are even worse than that: the Hungarian
>> stemmer code seems to be confused about this too. In the first place,
>> we've got a LATIN1 version of that stemmer, which I would imagine is
>> entirely useless; and in the second place, the UTF8 version has no
>> reference to any non-LATIN1 characters.
>> Again, I'm suspecting this problem goes further than Hungarian,
>> because the set of stem_ISO_8859_1_foo.c files in
>> src/backend/snowball/libstemmer/ covers a lot more languages than
>> I think LATIN1 is meant to cope with. I'm not sure how much of this
>> is broken in the original Snowball code and how much is our error
>> while importing the code.
> After further analysis, it appears that:
>
> 1. The cause of the immediately complained-of problem is that we took
> the stopword file we got from the Snowball website to be in LATIN1,
> whereas it evidently was meant to be in LATIN2. The problematic
> characters were code 0xF5 in the file, which we translated to U+00F5,
> but the correct translation is U+0151. (There is another discrepancy
> between LATIN1 and LATIN2 at code point 0xFB, but by chance there are
> none of those in the stopword file.)
>
> 2. The Snowball people were just as confused as we were about the
> appropriate encoding to use for Hungarian: their code claims that the
> Hungarian stemmer can run in LATIN1, and contains this table of non-ASCII
> character codes used in it:
>
> /* special characters (in ISO Latin I) */
>
> stringdef a' hex 'E1' //a-acute
> stringdef e' hex 'E9' //e-acute
> stringdef i' hex 'ED' //i-acute
> stringdef o' hex 'F3' //o-acute
> stringdef o" hex 'F6' //o-umlaut
> stringdef oq hex 'F5' //o-double acute
> stringdef u' hex 'FA' //u-acute
> stringdef u" hex 'FC' //u-umlaut
> stringdef uq hex 'FB' //u-double acute
>
> Most of these codes are the same in LATIN1 and LATIN2, but o-double-acute
> and u-double-acute don't appear in LATIN1 at all, and the codes shown here
> are really for LATIN2.
>
> I've reported this issue upstream and there are fixes pending.
>
> 3. While I was concerned that there might be similar bugs in the other
> Snowball stemmers, it appears after a bit of research that LATIN1 is
> commonly used as an encoding for all the other languages the Snowball
> code claims it can be used for, even though in a few cases there are
> seldom-used characters that LATIN1 can't represent. So there's not a
> clear reason to think there are any other undetected problems (and
> I would certainly not be the man to find them if they exist).
>
>
> I've gone ahead and committed the encoding fix for hungarian.stop in all
> active branches. I'm going to wait for Snowball upstream to accept the
> proposed patches before I think about incorporating the code changes.
>
> I'm not real sure whether we should consider back-patching those changes.
> Right now, the Hungarian stemmer is applying rules meant for
> o-double-acute to o-tilde, which probably means that those stemming rules
> don't fire at all on actual Hungarian text. If we fix that then the
> stemmer will behave differently, which might not be all that desirable to
> change in a minor release. Perhaps we should only make the code changes
> in HEAD and 9.4?
>
> regards, tom lane
>
>
Not saying there is any problem, but you might like to check how the EUR
currency symbol is handled (it is in LATIN2, but not in LATIN1):

https://en.wikipedia.org/wiki/Euro_sign
U+20AC € euro sign
(HTML: |&#8364;| |&euro;|)

Cheers,
Gavin

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Alvaro Herrera 2014-06-11 03:30:15 Re: BUG #10589: hungarian.stop file spelling error
Previous Message Tom Lane 2014-06-11 03:09:22 Re: BUG #10589: hungarian.stop file spelling error