Quick Links

Re: encoding affects ICU regex character classification

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: encoding affects ICU regex character classification
Date:	2023-11-30 00:23:22
Message-ID:	f76b728ca0a8e63fb51acc8c1fbed141ce2fdbb3.camel@j-davis.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, 2023-11-29 at 18:56 -0500, Tom Lane wrote:
> We'd have to find an alternate source of knowledge to replace the
> <wctype.h> functions if we wanted to fix it fully ... can ICU do
> that?

My follow-up proposal is exactly along those lines, except that we
don't even need ICU.

By adding a couple lookup tables generated from the Unicode data files,
we can offer a pg_u_isalpha() family of functions. As a bonus, I have
some exhaustive tests to compare with what ICU does so we can protect
ourselves from simple mistakes.

I might as well send it now; patch attached (0003 is the interesting
one).

I also tested against the iswalpha() family of functions, and those
have very similar behavior (apart from the "C" locale, of course).
Character classification is not localized at all in libc or ICU as far
as I can tell.

There are some differences, and I don't understand why those
differences exist, so perhaps that's worth discussing. Some differences
seem to be related to the titlecase/uppercase distinction. Others are
strange, like how glibc counts some digit characters (outside 0-9) as
alphabetic. And some seem arbitrary, like excluding a few whitespace
characters. I can try to post more details if that would be helpful.

Another issue is that right now we are doing the wrong thing with ICU:
we should be using the u_isUAlphabetic() family of functions, not the
u_isalpha() family of functions.

Regards,
Jeff Davis

Attachment	Content-Type	Size
v1-0003-Add-Unicode-property-tables.patch	text/x-patch	85.8 KB
v1-0002-Shrink-unicode-category-table.patch	text/x-patch	101.7 KB
v1-0001-Minor-cleanup-for-unicode-update-build-and-test.patch	text/x-patch	7.4 KB

In response to

Re: encoding affects ICU regex character classification at 2023-11-29 23:56:04 from Tom Lane

Responses

Re: encoding affects ICU regex character classification at 2023-11-30 02:10:40 from Thomas Munro

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Jeremy Schneider	2023-11-30 01:03:45	Re: proposal: change behavior on collation version mismatch
Previous Message	Tomas Vondra	2023-11-30 00:10:48	Re: Parallel CREATE INDEX for BRIN indexes