From: | Noah Misch <noah(at)leadboat(dot)com> |
---|---|
To: | Jeff Davis <pgsql(at)j-davis(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Unicode full case mapping: PG_UNICODE_FAST, and standard-compliant UCS_BASIC |
Date: | 2025-04-17 13:58:41 |
Message-ID: | 20250417135841.33.nmisch@google.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, Jan 17, 2025 at 04:06:20PM -0800, Jeff Davis wrote:
> Committed 0001 and 0002.
> Upon reviewing the discussion threads, I removed the Unicode "adjust to
> Cased" behavior when titlecasing. As Peter pointed out[1], it doesn't
> match the documentation or expectations for INITCAP().
While commit d3d0983 changed most of the non-test pg_u_*() "bool posix"
arguments, it left a pg_u_isalnum(u, true) in strtitle_builtin() subroutine
initcap_wbnext(). The above paragraph may or may not be saying that's
intentional. Example of the consequence at non-ASCII decimal digits:
SELECT
str,
re,
regexp_count(str COLLATE pg_c_utf8, re) AS count_c_utf8,
regexp_count(str COLLATE pg_unicode_fast, re) AS count_unicode_fast,
regexp_count(str COLLATE unicode, re) AS count_unicode,
initcap(str COLLATE pg_c_utf8) AS initcap_c_utf8,
initcap(str COLLATE pg_unicode_fast) AS initcap_unicode_fast,
initcap(str COLLATE unicode) AS initcap_unicode
FROM
(VALUES (U&'foo\0661bar baz')) AS str_t(str),
(VALUES ('[[:digit:]]')) AS re_t(re)
ORDER BY 1, 2;
str │ foo١bar baz
re │ [[:digit:]]
count_c_utf8 │ 0
count_unicode_fast │ 1
count_unicode │ 1
initcap_c_utf8 │ Foo١Bar Baz
initcap_unicode_fast │ Foo١Bar Baz
initcap_unicode │ Foo١bar Baz
Should initcap_wbnext() pass in a locale-dependent "bool posix" argument like
the others calls the commit changed? Related message from the development of
pg_c_utf8, which you shared downthread:
https://www.postgresql.org/message-id/610d7f1b-c68c-4eb8-a03d-1515da304c58%40manitou-mail.org
Long-term, pg_u_isword() should have a "bool posix" argument. Currently, only
tests call that function. If it got a non-test caller,
https://www.unicode.org/reports/tr18/#word would have pg_u_isword() follow the
choice of posix compatibility like pg_u_isalnum() does.
From | Date | Subject | |
---|---|---|---|
Next Message | Andrew Dunstan | 2025-04-17 14:17:09 | disabled SSL log_like tests |
Previous Message | Tristan Partin | 2025-04-17 13:18:40 | Re: Decouple C++ support in Meson's PGXS from LLVM enablement |