Re: Unicode full case mapping: PG_UNICODE_FAST, and standard-compliant UCS_BASIC

From: Noah Misch <noah(at)leadboat(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Unicode full case mapping: PG_UNICODE_FAST, and standard-compliant UCS_BASIC
Date: 2025-04-17 13:58:41
Message-ID: 20250417135841.33.nmisch@google.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jan 17, 2025 at 04:06:20PM -0800, Jeff Davis wrote:
> Committed 0001 and 0002.

> Upon reviewing the discussion threads, I removed the Unicode "adjust to
> Cased" behavior when titlecasing. As Peter pointed out[1], it doesn't
> match the documentation or expectations for INITCAP().

While commit d3d0983 changed most of the non-test pg_u_*() "bool posix"
arguments, it left a pg_u_isalnum(u, true) in strtitle_builtin() subroutine
initcap_wbnext(). The above paragraph may or may not be saying that's
intentional. Example of the consequence at non-ASCII decimal digits:

SELECT
str,
re,
regexp_count(str COLLATE pg_c_utf8, re) AS count_c_utf8,
regexp_count(str COLLATE pg_unicode_fast, re) AS count_unicode_fast,
regexp_count(str COLLATE unicode, re) AS count_unicode,
initcap(str COLLATE pg_c_utf8) AS initcap_c_utf8,
initcap(str COLLATE pg_unicode_fast) AS initcap_unicode_fast,
initcap(str COLLATE unicode) AS initcap_unicode
FROM
(VALUES (U&'foo\0661bar baz')) AS str_t(str),
(VALUES ('[[:digit:]]')) AS re_t(re)
ORDER BY 1, 2;

str │ foo١bar baz
re │ [[:digit:]]
count_c_utf8 │ 0
count_unicode_fast │ 1
count_unicode │ 1
initcap_c_utf8 │ Foo١Bar Baz
initcap_unicode_fast │ Foo١Bar Baz
initcap_unicode │ Foo١bar Baz

Should initcap_wbnext() pass in a locale-dependent "bool posix" argument like
the others calls the commit changed? Related message from the development of
pg_c_utf8, which you shared downthread:
https://www.postgresql.org/message-id/610d7f1b-c68c-4eb8-a03d-1515da304c58%40manitou-mail.org

Long-term, pg_u_isword() should have a "bool posix" argument. Currently, only
tests call that function. If it got a non-test caller,
https://www.unicode.org/reports/tr18/#word would have pg_u_isword() follow the
choice of posix compatibility like pg_u_isalnum() does.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2025-04-17 14:17:09 disabled SSL log_like tests
Previous Message Tristan Partin 2025-04-17 13:18:40 Re: Decouple C++ support in Meson's PGXS from LLVM enablement