From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Hugh Ranalli <hugh(at)whtc(dot)ca> |
Cc: | Daniel Verite <daniel(at)manitou-mail(dot)org>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org |
Subject: | Re: BUG #15548: Unaccent does not remove combining diacritical characters |
Date: | 2018-12-15 18:44:48 |
Message-ID: | 23237.1544899488@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs pgsql-hackers |
Hugh Ranalli <hugh(at)whtc(dot)ca> writes:
> On Fri, 14 Dec 2018 at 17:50, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Me too -- seems like that bears looking into. Perhaps the script's
>> results are platform dependent -- what were you testing on?
> I'm on Linux Mint 17, which is based on Ubuntu 14.04. But I don't think
> that's it. The program's decisions come from the two data files, the
> Unicode data set and the Latin-ASCII transliteration file. The script uses
> categories (
> ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html#General%20Category)
> to identify letters (and now combining marks) and if they are in range,
> performs a substitution. It then uses the transliteration file to find
> rules for particular character substitutions (for example, that file seems
> to handle the copyright symbol substitution). I don't see anything platform
> dependent in there.
Hm. Something funny is going on here. When I fetch the two reference
files from the URLs cited in the script, and do
python2 generate_unaccent_rules.py --unicode-data-file UnicodeData.txt --latin-ascii-file Latin-ASCII.xml >newrules
I get something that's bit-for-bit the same as what's in unaccent.rules.
So there's clearly a platform difference between here and there.
I'm using Python 2.6.6, which is what ships with RHEL6; have not tried
it on anything newer.
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2018-12-15 19:03:58 | Re: BUG #15548: Unaccent does not remove combining diacritical characters |
Previous Message | Hugh Ranalli | 2018-12-15 18:08:00 | Re: BUG #15548: Unaccent does not remove combining diacritical characters |
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2018-12-15 19:03:58 | Re: BUG #15548: Unaccent does not remove combining diacritical characters |
Previous Message | Tom Lane | 2018-12-15 18:31:29 | Re: Improving collation-dependent indexes in system catalogs |