Re: BUG #18362: unaccent rules and Old Greek text

From: Peter Eisentraut <peter(at)eisentraut(dot)org>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Cees van Zeeland <cees(dot)van(dot)zeeland(at)freedom(dot)nl>, Michael Paquier <michael(at)paquier(dot)xyz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #18362: unaccent rules and Old Greek text
Date: 2024-05-15 07:01:10
Message-ID: 1bcd13b7-6e00-4de1-961e-b7669f05a2da@eisentraut.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On 14.05.24 16:51, Robert Haas wrote:
> 2. The question of which mappings we actually ought to be adding seems
> a lot harder, because it's not altogether clear what it means to
> "remove an accent". The proposed patch adds a whole lot of rules that
> turn tiny little characters into full-sized characters, boldfaced
> and/or italicized and/or otherwise-fancily-printed characters into
> full-sized characters. Only a handful of the changes are actually
> adding rules that specifically*remove an accent*, but there are
> similar rules that already exist, like turning ⅐ into the
> four-character sequence " 1/7" and blocky-looking versions of each
> letter into standard versions and ㍱ into the three-character sequence
> "hPa". So my naive guess would be that we want all of these rules,
> even though you would not guess from the unaccent documentation that
> it's supposed to do stuff like this.

unaccent actually does both accent removal and ligature expansion.
(This is documented.) The cases you show above are ligature expansions.

You can also run generate_unaccent_rules.py with --no-ligatures and then
you get a smaller list that indeed looks more like just accent removal.

It does look like that whatever it thinks a ligature is has some
unintuitive results.

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2024-05-15 07:03:12 BUG #18467: postgres_fdw (deparser) ignores LimitOption
Previous Message Peter Eisentraut 2024-05-15 06:45:27 Re: BUG #18362: unaccent rules and Old Greek text