| From: | Jeff Davis <pgsql(at)j-davis(dot)com> | 
|---|---|
| To: | Peter Eisentraut <peter(at)eisentraut(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com> | 
| Cc: | pgsql-hackers(at)postgresql(dot)org | 
| Subject: | Re: Pre-proposal: unicode normalized text | 
| Date: | 2023-10-17 03:32:19 | 
| Message-ID: | c5e9dac884332824e0797937518da0b8766c1238.camel@j-davis.com | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
On Wed, 2023-10-11 at 08:56 +0200, Peter Eisentraut wrote:
> We need to be careful about precise terminology.  "Valid" has a
> defined 
> meaning for Unicode.  A byte sequence can be valid or not as UTF-8. 
> But 
> a string containing unassigned code points is not not-"valid" as
> Unicode.
New patch attached, function name is "unicode_assigned".
I believe the patch has utility as-is, but I've been brainstorming a
few more ideas that could build on it:
* Add a per-database option to enforce only storing assigned unicode
code points.
* (More radical) Add a per-database option to normalize all text in
NFC.
* Do character classification in Unicode rather than relying on
glibc/ICU. This would affect regex character classes, etc., but not
affect upper/lower/initcap nor collation. I did some experiments and
the General Category doesn't change a lot: a total of 197 characters
changed their General Category since Unicode 6.0.0, and only 5 since
ICU 11.0.0. I'm not quite sure how to expose this, but it seems like a
nicer way to handle it than tying it into the collation provider.
Regards,
	Jeff Davis
| Attachment | Content-Type | Size | 
|---|---|---|
| v3-0001-Additional-unicode-primitive-functions.patch | text/x-patch | 214.1 KB | 
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Amit Langote | 2023-10-17 04:02:45 | Re: remaining sql/json patches | 
| Previous Message | zhihuifan1213 | 2023-10-17 03:17:13 | Re: UniqueKey v2 |