From: | "Daniel Verite" <daniel(at)manitou-mail(dot)org> |
---|---|
To: | "Jeff Davis" <pgsql(at)j-davis(dot)com> |
Cc: | Peter Eisentraut <peter(at)eisentraut(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jeremy Schneider <schneider(at)ardentperf(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Built-in CTYPE provider |
Date: | 2024-03-27 15:53:33 |
Message-ID: | 610d7f1b-c68c-4eb8-a03d-1515da304c58@manitou-mail.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Jeff Davis wrote:
> The tests include initcap('123abc') which is '123abc' in the PG_C_UTF8
> collation vs '123Abc' in PG_UNICODE_FAST.
>
> The reason for the latter behavior is that the Unicode Default Case
> Conversion algorithm for toTitlecase() advances to the next Cased
> character before mapping to titlecase, and digits are not Cased. ICU
> has a configurable adjustment, and defaults in a way that produces
> '123abc'.
Even aside from ICU, there's a different behavior between glibc
and pg_c_utf8 glibc for codepoints in the decimal digit category
outside of the US-ASCII range '0'..'9',
select initcap(concat(chr(0xff11), 'a') collate "C.utf8"); -- glibc 2.35
initcap
---------
1a
select initcap(concat(chr(0xff11), 'a') collate "pg_c_utf8");
initcap
---------
1A
Both collations consider that chr(0xff11) is not a digit
(isdigit()=>false) but C.utf8 says that it's alpha, whereas pg_c_utf8
says it's neither digit nor alpha.
AFAIU this is why in the above initcap() call, pg_c_utf8 considers
that 'a' is the first alphanumeric, whereas C.utf8 considers that '1'
is the first alphanumeric, leading to different capitalizations.
Comparing the 3 providers:
WITH v(provider,type,result) AS (values
('ICU', 'isalpha', chr(0xff11) ~ '[[:alpha:]]' collate "unicode"),
('glibc', 'isalpha', chr(0xff11) ~ '[[:alpha:]]' collate "C.utf8"),
('builtin', 'isalpha', chr(0xff11) ~ '[[:alpha:]]' collate "pg_c_utf8"),
('ICU', 'isdigit', chr(0xff11) ~ '[[:digit:]]' collate "unicode"),
('glibc', 'isdigit', chr(0xff11) ~ '[[:digit:]]' collate "C.utf8"),
('builtin', 'isdigit', chr(0xff11) ~ '[[:digit:]]' collate "pg_c_utf8")
)
select * from v
\crosstabview
provider | isalpha | isdigit
----------+---------+---------
ICU | f | t
glibc | t | f
builtin | f | f
Are we fine with pg_c_utf8 differing from both ICU's point of view
(U+ff11 is digit and not alpha) and glibc point of view (U+ff11 is not
digit, but it's alpha)?
Aside from initcap(), this is going to be significant for regular
expressions.
Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2024-03-27 15:54:22 | Re: Flushing large data immediately in pqcomm |
Previous Message | Regina Obe | 2024-03-27 15:50:55 | Can't compile PG 17 (master) from git under Msys2 autoconf |