Re: BUG #18362: unaccent rules and Old Greek text

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Peter Eisentraut <peter(at)eisentraut(dot)org>, Cees van Zeeland <cees(dot)van(dot)zeeland(at)freedom(dot)nl>, Michael Paquier <michael(at)paquier(dot)xyz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #18362: unaccent rules and Old Greek text
Date: 2024-05-18 09:36:25
Message-ID: CA+hUKGJmgaxpNn5x1Po1kmUxDiojsYWVWKKvhX+4QnyjDCWKKQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Thu, May 16, 2024 at 1:40 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, May 15, 2024 at 2:45 AM Peter Eisentraut <peter(at)eisentraut(dot)org> wrote:
> > On 14.05.24 16:51, Robert Haas wrote:
> > The rules are only loaded once on first use, right? I tested with
> >
> > date; for x in $(seq 1 1000); do psql -X -c "select unaccent('foobar')"
> > -o /dev/null; done; date
> >
> > and this had the same runtime (about 8 seconds here) with and without
> > the patch.
>
> Cool. Sounds like that's not a problem.

Thanks Peter for testing, and thanks Robert for kicking this thread.

> > Btw., with the patch I get
> >
> > WARNING: duplicate source strings, first one will be used
> >
> > so it will need to adjustments in how the rules are produced.
>
> OK. Does anyone want to look into that?

I think the problem is that the new "simple redirection" rule from the
Unicode database produces some values that are also present in
Latin-ASCII.xml, and these are all tolerated as long as the "from" and
"to" strings both match, because we uniquify them as pairs. But there
is one pair where the "to" string is different, resulting in this
clash:

ℌ x
ℌ H

I think the first line might actually be a bug in CLDR data. I dunno,
but this just doesn't look right:

ℌ → x ; # 210C;BLACK-LETTER CAPITAL H (compat)

And in the tests I now see that Michael had already figured that out!
I've included a kludge to remove that. Someone should file a ticket with CLDR.

Attachment Content-Type Size
v2-0001-Add-simple-codepoint-redirections-to-unaccent.rul.patch application/x-patch 13.4 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Martijn Wallet 2024-05-18 14:02:17 Re: BUG #18348: Inconsistency with EXTRACT([field] from INTERVAL);
Previous Message Thomas Munro 2024-05-18 05:56:48 Re: [EXTERNAL] Re: Windows Application Issues | PostgreSQL | REF # 48475607