From: | Jeff Davis <pgsql(at)j-davis(dot)com> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Pre-proposal: unicode normalized text |
Date: | 2023-09-12 22:47:10 |
Message-ID: | f30b58657ceb71d5be032decf4058d454cc1df74.camel@j-davis.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
One of the frustrations with using the "C" locale (or any deterministic
locale) is that the following returns false:
SELECT 'á' = 'á'; -- false
because those are the unicode sequences U&'\0061\0301' and U&'\00E1',
respectively, so memcmp() returns non-zero. But it's really the same
character with just a different representation, and if you normalize
them they are equal:
SELECT normalize('á') = normalize('á'); -- true
The idea is to have a new data type, say "UTEXT", that normalizes the
input so that it can have an improved notion of equality while still
using memcmp().
Unicode guarantees that "the results of normalizing a string on one
version will always be the same as normalizing it on any other version,
as long as the string contains only assigned characters according to
both versions"[1]. It also guarantees that it "will not reallocate,
remove, or reassign" characters[2]. That means that we can normalize in
a forward-compatible way as long as we don't allow the use of
unassigned code points.
I looked at the standard to see what it had to say, and is discusses
normalization, but a standard UCS string with an unassigned code point
is not an error. Without a data type to enforce the constraint that
there are no unassigned code points, we can't guarantee forward
compatibility. Some other systems support NVARCHAR, but I didn't see
any guarantee of normalization or blocking unassigned code points
there, either.
UTEXT benefits:
* slightly better natural language semantics than TEXT with
deterministic collation
* still deterministic=true
* fast memcmp()-based comparisons
* no breaking semantic changes as unicode evolves
TEXT allows unassigned code points, and generally returns the same byte
sequences that were orgiinally entered; therefore UTEXT is not a
replacement for TEXT.
UTEXT could be built-in or it could be an extension or in contrib. If
an extension, we'd probably want to at least expose a function that can
detect unassigned code points, so that it's easy to be consistent with
the auto-generated unicode tables. I also notice that there already is
an unassigned code points table in saslprep.c, but it seems to be
frozen as of Unicode 3.2, and I'm not sure why.
Questions:
* Would this be useful enough to justify a new data type? Would it be
confusing about when to choose one versus the other?
* Would cross-type comparisons between TEXT and UTEXT become a major
problem that would reduce the utility?
* Should "some_utext_value = some_text_value" coerce the LHS to TEXT
or the RHS to UTEXT?
* Other comments or am I missing something?
Regards,
Jeff Davis
[1] https://unicode.org/reports/tr15/
[2] https://www.unicode.org/policies/stability_policy.html
From | Date | Subject | |
---|---|---|---|
Next Message | Jeff Davis | 2023-09-12 22:55:52 | Re: [17] CREATE SUBSCRIPTION ... SERVER |
Previous Message | Jacob Champion | 2023-09-12 22:09:29 | Re: Row pattern recognition |