Re: BUG #18216: Unaccent function is unable to remove accents (diacritic signs) from Japanese character 'ド'

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Francisco Olarte <folarte(at)peoplecall(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, shailesh(dot)totale(at)sailpoint(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #18216: Unaccent function is unable to remove accents (diacritic signs) from Japanese character 'ド'
Date: 2023-11-29 08:45:09
Message-ID: CAFj8pRALjAQmCjQ+NiCPpob+dAprBFPb2XqZPeYDHEjdJmYK9A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi

st 29. 11. 2023 v 9:13 odesílatel Francisco Olarte <folarte(at)peoplecall(dot)com>
napsal:

> Hi Jeff:
>
> On Wed, 29 Nov 2023 at 03:40, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>
> I am not going to generally discuss this:
> > But isn't it generally the case that removing accents might make you
> land on a different word with a different meaning?
>
> But this one is a bad example,
> > 'ano' and 'año' for example mean different things in Spanish (but
> unaccent removes it anyway, at least in one out of four attempts to get the
> non-7-bit-ASCII wedged through my terminal and into the function).
>
> N and Ñ are different letters in spanish. It looks like an accent, can
> be typed as such and some unaccent rules in some programs may make
> them equal, Ñ is as different from N as it is from Z ( I am spanish,
> and in case you want some authority link see
> https://www.rae.es/dpd/%C3%B1 ). It has it own pages in the dictionary
> ( even on paper, I just checked in case my memory fails ).
>
> We used to have also CH and LL as letters, but they were dropped
> "recently" ( that meaning this century, I'm getting old ).
>
> On the other "accents", à,è,ì,ò, ù can generally be unaccented w/o
> problem, although they may change meaning in some corner cases I do
> not remember seen them do that since the special examples in school.
> Other thing is ü, which is used on our "special" handling of hard/soft
> vowels after g, i.e., you do not pronounce the u in "reguero" ( bot
> modify how you pronounce the g, differently from agente ), but in
> "agüero" you do pronounce it.
>
> But Ñ is a proper letter, you cannot break it. Our alphabet goes
> m-n-ñ-o-p-q.
>

Some users use unaccent for transformation to 7bit ASCII.

In the Czech language I can find more examples, where removing diacritics
means significant loss and the meaning of the world should be based only on
context.

Žár (the heat) -> zar
Zář (the shine) -> zar
Být (to be) -> byt
Byt (the flat)-> byt

And for unaccent we expected this loss.

So my question is, can the unaccent function be used for transformation to
7bit ASCII or is it wrong usage?

Regards

Pavel

>
> Francisco Olarte.
>
> P.S. to really sound spanish, we would have picked up "cono" for the
> examples :-p
>
> FO
>
>
>

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Peter Eisentraut 2023-11-29 09:13:54 Re: BUG #18216: Unaccent function is unable to remove accents (diacritic signs) from Japanese character 'ド'
Previous Message Francisco Olarte 2023-11-29 08:12:45 Re: BUG #18216: Unaccent function is unable to remove accents (diacritic signs) from Japanese character 'ド'