From: | Hugh Ranalli <hugh(at)whtc(dot)ca> |
---|---|
To: | thomas(dot)munro(at)enterprisedb(dot)com |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Daniel Verite <daniel(at)manitou-mail(dot)org>, pgsql-bugs(at)lists(dot)postgresql(dot)org |
Subject: | Re: BUG #15548: Unaccent does not remove combining diacritical characters |
Date: | 2018-12-20 22:39:36 |
Message-ID: | CAAhbUMNyZ+PhNr_mQ=G161K0-hvbq13Tz2is9M3WK+yX9cQOCw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs pgsql-hackers |
Okay, I've tried to separate everything cleanly. The patches are numbered
in the order in which they should be applied. Each patch contains all the
updates appropriate to that version (i.e., if the change would modify
unaccent.rules, those changes are also in the patch):
01 - Updates generate_unaccent_rules.py to be Python 2 and 3 compatible.
The approach I have taken is "native" Python 3 compatibility with
adjustments for Python 2. There's a marked block at the beginning of the
file that can be removed whenever Python 2 support is dropped. I haven't
followed the recommended practice of importing the "past" or "future"
modules, as the changes are minimal, and these are just additional
dependencies that need to be installed separately, which didn't seem to
make sense for a utility script. This patch also updates sql/unaccent.sql
to UTF-8 format.
02 - Updates generate_unaccent_rules.py to work with all versions (I tested
r28 and r34) of the Latin-ASCII transliteration file. It also updates
unaccent.rules to have the output of the r34 transliteration file. This
patch should work without the 01 patch.
03 - Updates generate_unaccent_rules.py to remove combining diacritical
marks. It also updates unaccent.rules with the revised output, and adds
tests to sql/unaccent.sql. It will not work or apply if the 01 patch is not
applied. It should without the 02 patch.
When you look at unaccent.rules generated by the 03 version, there may
appear to be blank lines. I've checked and they're not blank. They are
characters which are only visible with other characters in front of them,
at least in my editor.
I'll go update the CommitFest now. I hope I've covered everything; please
let me know if there's anything I've missed.
Best wishes,
Hugh
Attachment | Content-Type | Size |
---|---|---|
01-generate-unaccent-rules-python2-and-3-01.patch | text/x-patch | 4.2 KB |
02-generate_unaccent_rules-handle-all-Latin-ASCII-versions-01.patch | text/x-patch | 1.7 KB |
03-generate_unaccent_rules-remove-combining-diacritical-accents-01.patch | text/x-patch | 3.9 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Etsuro Fujita | 2018-12-21 03:49:25 | Re: BUG #15552: Unexpected error in COPY to a foreign table in a transaction |
Previous Message | Tom Lane | 2018-12-20 17:56:25 | Re: BUG #15553: "ERROR: cache lookup failed for type 2" with a function the first time it run. |
From | Date | Subject | |
---|---|---|---|
Next Message | Alexander Korotkov | 2018-12-20 22:50:41 | Re: GIN predicate locking slows down valgrind isolationtests tremendously |
Previous Message | Andres Freund | 2018-12-20 22:33:59 | Re: Tid scan improvements |