Re: Unicode full case mapping: PG_UNICODE_FAST, and standard-compliant UCS_BASIC

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Unicode full case mapping: PG_UNICODE_FAST, and standard-compliant UCS_BASIC
Date: 2024-12-16 20:49:14
Message-ID: 179d2b9eb62ce9584177f4c863174c94de9985b7.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 2024-12-11 at 15:52 -0800, Jeff Davis wrote:
> Attached is a series of patches to implement full case mapping as the
> locale PG_UNICODE_FAST.

Rebased and attached.

I'm having a doubt about the correctness, though. There's a statement
in SpecialCasing.txt:

# IMPORTANT-when iota-subscript (0345) is uppercased or titlecased,
# the result will be incorrect unless the iota-subscript is moved to
the end
# of any sequence of combining marks. Otherwise, the accents will go
on the capital iota.
# This process can be achieved by first transforming the text to NFC
before casing.
# E.g. <alpha><iota_subscript><acute> is uppercased to
<ALPHA><acute><IOTA>

That requirement doesn't appear to exist in the Unicode standard
itself, nor is it implied from the mappings in the data files. And
based on the description, it only matters if the input is not
normalized in NFC (I believe NFD is also fine, because the combining
class of U+0345 is 240, higher than any other class). Furthermore, it
appears that the ICU root collation doesn't bother trying to implement
that requirement for non-normalized input.

There is a related requirement for caseless matching in the Unicode
standard[1] that requires normalization iff the source includes U+0345
(or any character which has U+0345 in its decomposition). The ICU
u_strFoldCase() function doesn't do that, either.

I'm not sure how important these requirements are, but I'm bringing
them up now because we can't change them after release, and they may be
technically incorrect for non-normalized input.

Regards,
Jeff Davis

[1] Unicode 16.0 section 3.13.5 rule D145
https://www.unicode.org/versions/Unicode16.0.0/UnicodeStandard-16.0.pdf#G34145

Attachment Content-Type Size
v2-0001-Support-Unicode-full-case-mapping-and-conversion.patch text/x-patch 559.2 KB
v2-0002-Support-PG_UNICODE_FAST-locale-in-the-builtin-col.patch text/x-patch 18.7 KB
v2-0003-Change-UCS_BASIC-to-use-the-builtin-PG_UNICODE_FA.patch text/x-patch 1.7 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2024-12-16 20:55:27 Re: Improving default column names/aliases of subscript text expressions
Previous Message Jelte Fennema-Nio 2024-12-16 20:26:13 Re: Improving default column names/aliases of subscript text expressions