From: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Peter Eisentraut <peter(at)eisentraut(dot)org>, Cees van Zeeland <cees(dot)van(dot)zeeland(at)freedom(dot)nl>, Michael Paquier <michael(at)paquier(dot)xyz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-bugs(at)lists(dot)postgresql(dot)org |
Subject: | Re: BUG #18362: unaccent rules and Old Greek text |
Date: | 2024-05-18 09:36:25 |
Message-ID: | CA+hUKGJmgaxpNn5x1Po1kmUxDiojsYWVWKKvhX+4QnyjDCWKKQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
On Thu, May 16, 2024 at 1:40 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, May 15, 2024 at 2:45 AM Peter Eisentraut <peter(at)eisentraut(dot)org> wrote:
> > On 14.05.24 16:51, Robert Haas wrote:
> > The rules are only loaded once on first use, right? I tested with
> >
> > date; for x in $(seq 1 1000); do psql -X -c "select unaccent('foobar')"
> > -o /dev/null; done; date
> >
> > and this had the same runtime (about 8 seconds here) with and without
> > the patch.
>
> Cool. Sounds like that's not a problem.
Thanks Peter for testing, and thanks Robert for kicking this thread.
> > Btw., with the patch I get
> >
> > WARNING: duplicate source strings, first one will be used
> >
> > so it will need to adjustments in how the rules are produced.
>
> OK. Does anyone want to look into that?
I think the problem is that the new "simple redirection" rule from the
Unicode database produces some values that are also present in
Latin-ASCII.xml, and these are all tolerated as long as the "from" and
"to" strings both match, because we uniquify them as pairs. But there
is one pair where the "to" string is different, resulting in this
clash:
ℌ x
ℌ H
I think the first line might actually be a bug in CLDR data. I dunno,
but this just doesn't look right:
ℌ → x ; # 210C;BLACK-LETTER CAPITAL H (compat)
And in the tests I now see that Michael had already figured that out!
I've included a kludge to remove that. Someone should file a ticket with CLDR.
Attachment | Content-Type | Size |
---|---|---|
v2-0001-Add-simple-codepoint-redirections-to-unaccent.rul.patch | application/x-patch | 13.4 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Martijn Wallet | 2024-05-18 14:02:17 | Re: BUG #18348: Inconsistency with EXTRACT([field] from INTERVAL); |
Previous Message | Thomas Munro | 2024-05-18 05:56:48 | Re: [EXTERNAL] Re: Windows Application Issues | PostgreSQL | REF # 48475607 |