Quick Links

Re: Pre-proposal: unicode normalized text

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Peter Eisentraut <peter(at)eisentraut(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Pre-proposal: unicode normalized text
Date:	2023-10-17 03:32:19
Message-ID:	c5e9dac884332824e0797937518da0b8766c1238.camel@j-davis.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, 2023-10-11 at 08:56 +0200, Peter Eisentraut wrote:
> We need to be careful about precise terminology. "Valid" has a
> defined
> meaning for Unicode. A byte sequence can be valid or not as UTF-8.
> But
> a string containing unassigned code points is not not-"valid" as
> Unicode.

New patch attached, function name is "unicode_assigned".

I believe the patch has utility as-is, but I've been brainstorming a
few more ideas that could build on it:

* Add a per-database option to enforce only storing assigned unicode
code points.

* (More radical) Add a per-database option to normalize all text in
NFC.

* Do character classification in Unicode rather than relying on
glibc/ICU. This would affect regex character classes, etc., but not
affect upper/lower/initcap nor collation. I did some experiments and
the General Category doesn't change a lot: a total of 197 characters
changed their General Category since Unicode 6.0.0, and only 5 since
ICU 11.0.0. I'm not quite sure how to expose this, but it seems like a
nicer way to handle it than tying it into the collation provider.

Regards,
Jeff Davis

Attachment	Content-Type	Size
v3-0001-Additional-unicode-primitive-functions.patch	text/x-patch	214.1 KB

In response to

Re: Pre-proposal: unicode normalized text at 2023-10-11 06:56:13 from Peter Eisentraut

Responses

Re: Pre-proposal: unicode normalized text at 2023-10-17 15:07:40 from Daniel Verite
Re: Pre-proposal: unicode normalized text at 2023-10-27 21:15:00 from Jeff Davis

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Amit Langote	2023-10-17 04:02:45	Re: remaining sql/json patches
Previous Message	zhihuifan1213	2023-10-17 03:17:13	Re: UniqueKey v2