Re: BUG #13440: unaccent does not remove all diacritics

From: Emre Hasegeli <emre(at)hasegeli(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Michael Gradek <mike(at)busbud(dot)com>, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #13440: unaccent does not remove all diacritics
Date: 2015-06-19 09:51:25
Message-ID: CAE2gYzxRa6wWWL1NS2e8+sjzdNKRu5tMs-AGMdo2wcmq6RfTDg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

> To me, conceptually what unaccent does is turn whatever junk you have
> into a very basic common alphabet (ascii); then it's very easy to do
> full text searches without having to worry about what accents the people
> did or did not use in their searches. If we say "okay, but that funny
> char is not an accent so let's leave it alone" then the charter doesn't
> sound so useful to me.

It is the same for me. It is unfortunate that this module is named
as "unaccent". There are many characters on the rule file that has
nothing do with accents. They are normal letters on some alphabets
which are not in ASCII. "replace-with-ascii" would be a better name
for it.

> The cases I care about are okay anyway, because all the funny chars in
> spanish are already covered; and maybe German people always enter their
> queries using the funny ss thing I can't even write, and then this is
> not a problem for them.

I am learning German only for a few months, and even I can confirm
that replacing "ß" with "s", or "ü" with "u" is wrong. On the other
hand if they would be correctly replaced with "ss" and "ou", I would
be really unhappy because it is just too common in Turkish to press
"u" instead of "ü".

I think it is better for this module to replace those characters with
a single ASCII character that sounds similar. With this point of
view I think is fine to replace "ß" with "s" even if it is obviously
wrong. This module will never be useful for German without breaking
other usages, anyway. We can try to cover as many characters as
possible keeping this in mind.

It would also be nice support other rules for real "unaccent", and
correct replacement for German. Maybe we can add different rule
files to this module.

> Regarding back-patching unaccent.rules changes as discussed downthread,
> I think it's okay to simply document that any indexes using the module
> should be reindexed immediately after upgrading to that minor version.
> The consequence of not doing so is not *that* serious anyway. But then,
> since I'm not actually affected in any way, I'm not strongly holding
> this position either.

I think it would cause more trouble than help, if we ever backpack
changes on this rules.

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Christoph Berg 2015-06-19 14:50:16 Re: [GENERAL] pg_xlog on a hot_standby slave filling up
Previous Message Thomas Munro 2015-06-19 05:28:40 Re: BUG #13440: unaccent does not remove all diacritics