Quick Links

Re: encoding affects ICU regex character classification

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: encoding affects ICU regex character classification
Date:	2023-12-01 20:49:46
Message-ID:	40bb6ee11ab2679024d625224f026f9e198cc7f7.camel@j-davis.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Thu, 2023-11-30 at 15:10 +1300, Thomas Munro wrote:
> > On Thu, Nov 30, 2023 at 1:23 PM Jeff Davis <pgsql(at)j-davis(dot)com>
> > wrote:
> > > > Character classification is not localized at all in libc or ICU
> > > > as > > far
> > > > as I can tell.
> >
> > Really? POSIX isalpha()/isalpha_l() and friends clearly depend on
> > a
> > locale. See eg d522b05c for a case where that broke something.

I believe we're using different definitions of "localized". What I mean
is "varies from region to region or language to language". I think you
mean "varies for any reason at all [perhaps for no reason?]".

For instance, that commit indirectly links to:

https://github.com/evanj/isspace_locale

Which says "Mac OS X in a UTF-8 locale...". I don't see any fundamental
locale-based concern there.

I wrote a test program (attached) which compares any two given libc
locales using both the ordinary isalpha() family of functions, and also
using the iswalpha() family of functions. For the former, I only test
up to 0x7f. For the latter, I went to some effort to properly translate
the code point to a wchar_t (encode as UTF8, then mbstowcs using a UTF-
8 locale), and I test all unicode code points except the surrogate
range.

Using the test program, I compared the C.UTF-8 locale to every other
installed locale on my system (attached list for reference) and the
only ones that show any differences are "C" and "POSIX". That, combined
with the fact that ICU doesn't even accept a locale argument to the
character classification functions, gives me a high degree of
confidence that character classification is not localized on my system
according to my definition of "localized". If someone else wants to run
the test program on their system, I'd be interested to see the results
(some platform-specific modification may be required, e.g. handling 16-
bit whcar_t, etc.).

Your definition is too wide in my opinion, because it mixes together
different sources of variation that are best left separate:
a. region/language
b. technical requirements
c. versioning
d. implementation variance

(a) is not a true source of variation (please correct me if I'm wrong)

(b) is perhaps interesting. The "C" locale is one example, and perhaps
there are others, but I doubt very many others that we want to support.

(c) is not a major concern in my opinion. The impact of Unicode changes
is usually not dramatic, and it only affects regexes so it's much more
contained than collation, for example. And if you really care, just use
the "C" locale.

(d) is mostly a bug. Most users would prefer standardization, platform-
independence, documentability, and testability. There are users who
might care a lot about compatibility, and I don't want to disrupt such
users, but long term I don't see a lot of value in bubbling up
semantics from libc into expressions when there's not a clear reason to
do so. (Note: getting semantics from libc is a bit dubious in the case
of collation, as well, but at least for collation there are regional
and linguistic differences that we can't handle internally.)

I think we only need 2 main character classification schemes: "C" and
Unicode (TR #18 Compatibility Properties[1], either the "Standard"
variant or the "POSIX Compatible" variant or both). The libc and ICU
ones should be there only for compatibility and discouraged and
hopefully eventually removed.

> > Not knowing anything about how glibc generates its charmaps,
> > Unicode
> > or pre-Unicode, I could take a wild guess that maybe in LATIN9 they
> > have an old hand-crafted table, but for UTF-8 encoding it's fully
> > outsourced to Unicode, and that's why you see a difference.

No, the problem is that we're passing a pg_wchar to an ICU function
that expects a 32-bit code point. Those two things are equivalent in
the UTF8 encoding, but not in the LATIN9 encoding.

See the comment at the top of regc_pg_locale.c, which should probably
be updated to describe what happens with ICU collations.

> > Another
> > problem seen in a few parts of our tree is that we sometimes feed
> > individual UTF-8 bytes to the isXXX() functions which is about as >
> > well
> > defined as trying to pay for a pint with the left half of a $10
> > bill.

If we have built-in character classification systems as I propose ("C"
and Unicode), then the callers can simply choose which well-defined one
to use.

> > also
> > might disagree with tables built by PostgreSQL.

The patch I provided (new version attached) exhaustively tests all the
new Unicode property tables, and also the class assignments based on
[1] and [2]. Everything is identical for all assigned code points. The
test will run whenever you "ninja update-unicode", so any
inconsistencies will be highly visible before release. Additionally,
because the tables are checked in, you'll be able to see (in the diff)
the impact from a Unicode version update and consider that impact when
writing the release notes.

You may be wondering about differences in the version of Unicode
between Postgres and ICU while the test is running. It only tests code
points that are assigned in both Unicode versions, and reports the
number of code points that are skipped due to this check. The person
running "update-unicode" may see a failing test or a large number of
skipped codepoints if the Unicode versions don't match, in which case
they should try running against a more closely-matching version of ICU.

Regards,
Jeff Davis

[1] http://www.unicode.org/reports/tr18/#Compatibility_Properties
[2]
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/uchar_8h.html#details

Attachment	Content-Type	Size
l.txt	text/plain	702 bytes
ctype_test.c	text/x-csrc	6.8 KB
v2-0003-Add-Unicode-property-tables.patch	text/x-patch	91.5 KB
v2-0002-Shrink-unicode-category-table.patch	text/x-patch	101.7 KB
v2-0001-Minor-cleanup-for-unicode-update-build-and-test.patch	text/x-patch	7.4 KB

In response to

Re: encoding affects ICU regex character classification at 2023-11-30 02:10:40 from Thomas Munro

Responses

Re: encoding affects ICU regex character classification at 2023-12-09 21:39:37 from Thomas Munro

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tristan Partin	2023-12-01 21:11:49	Re: Refactoring backend fork+exec code
Previous Message	Heikki Linnakangas	2023-12-01 20:44:08	Re: Refactoring backend fork+exec code