Quick Links

Re: encoding affects ICU regex character classification

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: encoding affects ICU regex character classification
Date:	2023-12-12 21:39:55
Message-ID:	03959b5f8b37b6126d0b9c6ac16c960a94fcd3bb.camel@j-davis.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Sun, 2023-12-10 at 10:39 +1300, Thomas Munro wrote:

>
> How would you specify what you want?

One proposal would be to have a builtin collation provider:

https://postgr.es/m/9d63548c4d86b0f820e1ff15a83f93ed9ded4543.camel@j-davis.com

I don't think there are very many ctype options, but they could be
specified as part of the locale, or perhaps even as some provider-
specific options specified at CREATE COLLATION time.

> As with collating, I like the
> idea of keeping support for libc even if it is terrible (some libcs
> more than others) and eventually not the default, because I think
> optional agreement with other software on the same host is a feature.

Of course we should keep the libc support around. I'm not sure how
relevant such a feature is, but I don't think we actually have to
remove it.

> Unless you also
> implement built-in case mapping, you'd still have to call libc or ICU
> for that, right?

We can do built-in case mapping, see:

https://postgr.es/m/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.camel@j-davis.com

> It seems a bit strange to use different systems for
> classification and mapping. If you do implement mapping too, you
> have
> to decide if you believe it is language-dependent or not, I think?

A complete solution would need to do the language-dependent case
mapping. But that seems to only be 3 locales ("az", "lt", and "tr"),
and only a handful of mapping changes, so we can handle that with the
builtin provider as well.

> Hmm, let's see what we're doing now... for ICU the regex code is
> using
> "simple" case mapping functions like u_toupper(c) that don't take a
> locale, so no Turkish i/İ conversion for you, unlike our SQL
> upper()/lower(), which this is supposed to agree with according to
> the
> comments at the top. I see why: POSIX can only do one-by-one
> character mappings (which cannot handle Greek's context-sensitive
> Σ->σ/ς or German's multi-character ß->SS)

Regexes are inherently character-by-character, so transformations like
ß->SS are not going to work for case-insensitive regex matching
regardless of the provider.

Σ->σ/ς does make sense, and what we have seems to be just broken:

select 'ς' ~* 'Σ'; -- false in both libc and ICU
select 'Σ' ~* 'ς'; -- true in both libc and ICU

Similarly for titlecase variants:

select 'ǅ' ~* 'ǆ'; -- false in libc and ICU
select 'ǆ' ~* 'ǅ'; -- true in libc and ICU

If we do the case mapping ourselves, we can make those work. We'd just
have to modify the APIs a bit so that allcases() can actually get all
of the case variants, rather than relying on just towupper/towlower.

Regards,
Jeff Davis

In response to

Re: encoding affects ICU regex character classification at 2023-12-09 21:39:37 from Thomas Munro

Responses

Re: encoding affects ICU regex character classification at 2023-12-12 22:35:57 from Jeremy Schneider

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Jeremy Schneider	2023-12-12 22:35:57	Re: encoding affects ICU regex character classification
Previous Message	Tristan Partin	2023-12-12 21:16:10	Clean up find_typedefs and add support for Mac