Re: BUG #13440: unaccent does not remove all diacritics

From: Michael Gradek <mike(at)busbud(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #13440: unaccent does not remove all diacritics
Date: 2015-06-16 01:58:04
Message-ID: CAEP8ZNVKxwBNyQx-CxcTL0hiNax3AScy208fs=8_Qp2cHt8y1A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Thanks everyone, I've been comparing the behavior to that of
https://github.com/andrewrk/node-diacritics/blob/master/index.js if that
can be of any help.

On Monday, June 15, 2015, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
wrote:

> On Tue, Jun 16, 2015 at 12:55 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us
> <javascript:;>> wrote:
> > Alvaro Herrera <alvherre(at)2ndquadrant(dot)com <javascript:;>> writes:
> >> My terminal shows these characters to be different. One is
> >> http://graphemica.com/%C8%9B
> >> latin small letter t with comma below (U+021B)
> >
> >> The other is
> >> http://graphemica.com/%C5%A3
> >> latin small letter t with cedilla (U+0163)
> >
> > Ah-hah -- I did not look closely enough. So the immediate answer for
> > Michael is to add another entry to his unaccent.rules file.
> >
> > Should we add the missing character to the standard unaccent.rules file?
>
> It looks like Romanian also has s with comma. Perhaps we should have
> all these characters:
>
> $ curl -s http://unicode.org/Public/7.0.0/ucd/UnicodeData.txt | egrep
> ';LATIN (SMALL|CAPITAL) LETTER [A-Z] WITH ' | wc -l
> 702
>
> That's quite a lot more than the 187 we currently have. Of those, I
> think only the following ligature characters don't fit the above
> pattern: Æ, æ, IJ, ij, Œ, œ, ß. Incidentally, I don't believe that the
> way we "unaccent" ligatures is correct anyway. Maybe they should be
> expanded to AE, ae, IJ, ij, OE, oe, ss, respectively, not A, a, I, i,
> O, o, S as we have it, but I guess it depends what the purpose of
> unaccent is...
>
> --
> Thomas Munro
> http://www.enterprisedb.com
>

--
Cheers,
Mike
--
Mike Gradek
Co-founder and CTO, Busbud
Busbud.com <http://busbud.com/> | mike(at)busbud(dot)com
*We're hiring!: Jobs at Busbud <http://www.busbud.com/en/about/jobs>*

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Thomas Munro 2015-06-16 03:30:53 Re: BUG #13440: unaccent does not remove all diacritics
Previous Message 德哥 2015-06-16 01:20:39 Re: BUG #13443: master will remove dead rows when hot standby(use slot) disconnect