Re: BUG #18216: Unaccent function is unable to remove accents (diacritic signs) from Japanese character 'ド'

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, shailesh(dot)totale(at)sailpoint(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #18216: Unaccent function is unable to remove accents (diacritic signs) from Japanese character 'ド'
Date: 2023-11-29 02:40:27
Message-ID: CAMkU=1xvF9NMPJgXTULGYw-5KqH5xduEPDqOT7gvbH2SRWJK-A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Tue, Nov 28, 2023 at 8:06 PM Michael Paquier <michael(at)paquier(dot)xyz> wrote:

> On Tue, Nov 28, 2023 at 09:58:35AM -0500, Tom Lane wrote:
> > PG Bug reporting form <noreply(at)postgresql(dot)org> writes:
> >> PostgreSQL's unaccent module does not use Unicode normalisation, but
> only a
> >> simple search-and-replace dictionary. The dictionary, unaccent.rules
> >> (
> https://github.com/postgres/postgres/blob/master/contrib/unaccent/unaccent.rules
> )
> >> , does not contain these Japanese characters, thus its unable to
> remove
> >> the diacritic signs. Can someone please guide when we can expect these
> >> Japanese characters will be added.
> >
> > unaccent.rules, as distributed, is just an example. It is not meant
> > to be exhaustive or authoritative.
>
> FWIW, I'm quite fluent in Japanese and was discussing a bit this
> around me and, like me, folks were kind of troubled with the concept
> that these should be considered as "accents", because it would
> entirely change the meaning of what each Hiragana and Katakana means.
>

But isn't it generally the case that removing accents might make you land
on a different word with a different meaning?

'ano' and 'año' for example mean different things in Spanish (but unaccent
removes it anyway, at least in one out of four attempts to get the
non-7-bit-ASCII wedged through my terminal and into the function).

That doesn't mean that unaccent is required to do it, of course. But
the possibility of changing the meaning doesn't seem like a reason not to
do it.

Cheers,

Jeff

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message zhihuifan1213 2023-11-29 05:26:52 Re: BUG #18213: Standby's repeatable read isolation level transaction encountered a "nonrepeatable read" problem
Previous Message Michael Paquier 2023-11-29 01:06:02 Re: BUG #18216: Unaccent function is unable to remove accents (diacritic signs) from Japanese character 'ド'