From: | Jeff Davis <pgsql(at)j-davis(dot)com> |
---|---|
To: | Daniel Verite <daniel(at)manitou-mail(dot)org> |
Cc: | Peter Eisentraut <peter(at)eisentraut(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jeremy Schneider <schneider(at)ardentperf(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Built-in CTYPE provider |
Date: | 2024-03-27 17:40:19 |
Message-ID: | 2f404017690b43e6951cd4a60798c3f9626bbe56.camel@j-davis.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, 2024-03-27 at 16:53 +0100, Daniel Verite wrote:
> provider | isalpha | isdigit
> ----------+---------+---------
> ICU | f | t
> glibc | t | f
> builtin | f | f
The "ICU" above is really the behvior of the Postgres ICU provider as
we implemented it, it's not something forced on us by ICU.
For the ICU provider, pg_wc_isalpha() is defined as u_isalpha()[1] and
pg_wc_isdigit() is defined as u_isdigit()[2]. Those, in turn, are
defined by ICU to be equivalent to java.lang.Character.isLetter() and
java.lang.Character.isDigit().
ICU documents[3] how regex character classes should be implemented
using the ICU APIs, and cites Unicode TR#18 [4] as the source. Despite
being under the heading "...for C/POSIX character classes...", [3] says
it's based on the "Standard" variant of [4], rather than "POSIX
Compatible".
(Aside: the Postgres ICU provider doesn't match what [3] suggests for
the "alpha" class. For the character U+FF11 it doesn't matter, but I
suspect there are differences for other characters. This should be
fixed.)
The differences between PG_C_UTF8 and what ICU suggests are just
because the former uses the "POSIX Compatible" definitions and the
latter uses "Standard".
I implemented both the "Standard" and "POSIX Compatible" compatibility
properties in ad49994538, so it would be easy to change what PG_C_UTF8
uses.
[1]
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/uchar_8h.html#aecff8611dfb1814d1770350378b3b283
[2]
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/uchar_8h.html#a42b37828d86daa0fed18b381130ce1e6
[3]
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/uchar_8h.html#details
[4]
http://www.unicode.org/reports/tr18/#Compatibility_Properties
> Are we fine with pg_c_utf8 differing from both ICU's point of view
> (U+ff11 is digit and not alpha) and glibc point of view (U+ff11 is
> not
> digit, but it's alpha)?
Yes, some differences are to be expected.
But I'm fine making a change to PG_C_UTF8 if it makes sense, as long as
we can point to something other than "glibc version 2.35 does it this
way".
Regards,
Jeff Davis
From | Date | Subject | |
---|---|---|---|
Next Message | Bharath Rupireddy | 2024-03-27 17:49:21 | Re: Add new error_action COPY ON_ERROR "log" |
Previous Message | Nathan Bossart | 2024-03-27 17:35:09 | Re: Slow GRANT ROLE on PostgreSQL 16 with thousands of ROLEs |