Quick Links

Re: encoding affects ICU regex character classification

From:	Jeremy Schneider <schneider(at)ardentperf(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: encoding affects ICU regex character classification
Date:	2023-12-12 22:35:57
Message-ID:	1b6fa4be-fdac-481e-82f5-1ffdfbbdb0fd@ardentperf.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 12/12/23 1:39 PM, Jeff Davis wrote:
> On Sun, 2023-12-10 at 10:39 +1300, Thomas Munro wrote:
>> Unless you also
>> implement built-in case mapping, you'd still have to call libc or ICU
>> for that, right?
>
> We can do built-in case mapping, see:
>
> https://postgr.es/m/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.camel@j-davis.com
>
>> It seems a bit strange to use different systems for
>> classification and mapping. If you do implement mapping too, you
>> have
>> to decide if you believe it is language-dependent or not, I think?
>
> A complete solution would need to do the language-dependent case
> mapping. But that seems to only be 3 locales ("az", "lt", and "tr"),
> and only a handful of mapping changes, so we can handle that with the
> builtin provider as well.

This thread has me second-guessing the reply I just sent on the other
thread.

Is someone able to test out upper & lower functions on U+A7BA ... U+A7BF
across a few libs/versions? Theoretically the upper/lower behavior
should change in ICU between Ubuntu 18.04 LTS and Ubuntu 20.04 LTS
(specifically in ICU 64 / Unicode 12). And I have no idea if or when
glibc might have picked up the new unicode characters.

-Jeremy

--
http://about.me/jeremy_schneider

In response to

Re: encoding affects ICU regex character classification at 2023-12-12 21:39:55 from Jeff Davis

Responses

Re: encoding affects ICU regex character classification at 2023-12-14 15:12:27 from Jeff Davis

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2023-12-12 23:02:26	Re: Clean up find_typedefs and add support for Mac
Previous Message	Jeff Davis	2023-12-12 21:39:55	Re: encoding affects ICU regex character classification