From: | Jeff Davis <pgsql(at)j-davis(dot)com> |
---|---|
To: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: encoding affects ICU regex character classification |
Date: | 2023-12-12 21:39:55 |
Message-ID: | 03959b5f8b37b6126d0b9c6ac16c960a94fcd3bb.camel@j-davis.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Sun, 2023-12-10 at 10:39 +1300, Thomas Munro wrote:
>
> How would you specify what you want?
One proposal would be to have a builtin collation provider:
https://postgr.es/m/9d63548c4d86b0f820e1ff15a83f93ed9ded4543.camel@j-davis.com
I don't think there are very many ctype options, but they could be
specified as part of the locale, or perhaps even as some provider-
specific options specified at CREATE COLLATION time.
> As with collating, I like the
> idea of keeping support for libc even if it is terrible (some libcs
> more than others) and eventually not the default, because I think
> optional agreement with other software on the same host is a feature.
Of course we should keep the libc support around. I'm not sure how
relevant such a feature is, but I don't think we actually have to
remove it.
> Unless you also
> implement built-in case mapping, you'd still have to call libc or ICU
> for that, right?
We can do built-in case mapping, see:
https://postgr.es/m/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.camel@j-davis.com
> It seems a bit strange to use different systems for
> classification and mapping. If you do implement mapping too, you
> have
> to decide if you believe it is language-dependent or not, I think?
A complete solution would need to do the language-dependent case
mapping. But that seems to only be 3 locales ("az", "lt", and "tr"),
and only a handful of mapping changes, so we can handle that with the
builtin provider as well.
> Hmm, let's see what we're doing now... for ICU the regex code is
> using
> "simple" case mapping functions like u_toupper(c) that don't take a
> locale, so no Turkish i/İ conversion for you, unlike our SQL
> upper()/lower(), which this is supposed to agree with according to
> the
> comments at the top. I see why: POSIX can only do one-by-one
> character mappings (which cannot handle Greek's context-sensitive
> Σ->σ/ς or German's multi-character ß->SS)
Regexes are inherently character-by-character, so transformations like
ß->SS are not going to work for case-insensitive regex matching
regardless of the provider.
Σ->σ/ς does make sense, and what we have seems to be just broken:
select 'ς' ~* 'Σ'; -- false in both libc and ICU
select 'Σ' ~* 'ς'; -- true in both libc and ICU
Similarly for titlecase variants:
select 'Dž' ~* 'dž'; -- false in libc and ICU
select 'dž' ~* 'Dž'; -- true in libc and ICU
If we do the case mapping ourselves, we can make those work. We'd just
have to modify the APIs a bit so that allcases() can actually get all
of the case variants, rather than relying on just towupper/towlower.
Regards,
Jeff Davis
From | Date | Subject | |
---|---|---|---|
Next Message | Jeremy Schneider | 2023-12-12 22:35:57 | Re: encoding affects ICU regex character classification |
Previous Message | Tristan Partin | 2023-12-12 21:16:10 | Clean up find_typedefs and add support for Mac |