From: | Michael Gradek <mike(at)busbud(dot)com> |
---|---|
To: | Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org> |
Subject: | Re: BUG #13440: unaccent does not remove all diacritics |
Date: | 2015-06-16 01:58:04 |
Message-ID: | CAEP8ZNVKxwBNyQx-CxcTL0hiNax3AScy208fs=8_Qp2cHt8y1A@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
Thanks everyone, I've been comparing the behavior to that of
https://github.com/andrewrk/node-diacritics/blob/master/index.js if that
can be of any help.
On Monday, June 15, 2015, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
wrote:
> On Tue, Jun 16, 2015 at 12:55 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us
> <javascript:;>> wrote:
> > Alvaro Herrera <alvherre(at)2ndquadrant(dot)com <javascript:;>> writes:
> >> My terminal shows these characters to be different. One is
> >> http://graphemica.com/%C8%9B
> >> latin small letter t with comma below (U+021B)
> >
> >> The other is
> >> http://graphemica.com/%C5%A3
> >> latin small letter t with cedilla (U+0163)
> >
> > Ah-hah -- I did not look closely enough. So the immediate answer for
> > Michael is to add another entry to his unaccent.rules file.
> >
> > Should we add the missing character to the standard unaccent.rules file?
>
> It looks like Romanian also has s with comma. Perhaps we should have
> all these characters:
>
> $ curl -s http://unicode.org/Public/7.0.0/ucd/UnicodeData.txt | egrep
> ';LATIN (SMALL|CAPITAL) LETTER [A-Z] WITH ' | wc -l
> 702
>
> That's quite a lot more than the 187 we currently have. Of those, I
> think only the following ligature characters don't fit the above
> pattern: Æ, æ, IJ, ij, Œ, œ, ß. Incidentally, I don't believe that the
> way we "unaccent" ligatures is correct anyway. Maybe they should be
> expanded to AE, ae, IJ, ij, OE, oe, ss, respectively, not A, a, I, i,
> O, o, S as we have it, but I guess it depends what the purpose of
> unaccent is...
>
> --
> Thomas Munro
> http://www.enterprisedb.com
>
--
Cheers,
Mike
--
Mike Gradek
Co-founder and CTO, Busbud
Busbud.com <http://busbud.com/> | mike(at)busbud(dot)com
*We're hiring!: Jobs at Busbud <http://www.busbud.com/en/about/jobs>*
From | Date | Subject | |
---|---|---|---|
Next Message | Thomas Munro | 2015-06-16 03:30:53 | Re: BUG #13440: unaccent does not remove all diacritics |
Previous Message | 德哥 | 2015-06-16 01:20:39 | Re: BUG #13443: master will remove dead rows when hot standby(use slot) disconnect |