From: | Jeff Davis <pgsql(at)j-davis(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Pre-proposal: unicode normalized text |
Date: | 2024-03-01 01:02:51 |
Message-ID: | a0e85aca6e03042881924c4b31a840a915a9d349.camel@j-davis.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, 2023-10-02 at 16:06 -0400, Robert Haas wrote:
> It seems to me that this overlooks one of the major points of Jeff's
> proposal, which is that we don't reject text input that contains
> unassigned code points. That decision turns out to be really painful.
Attached is an implementation of a per-database option STRICT_UNICODE
which enforces the use of assigned code points only.
Not everyone would want to use it. There are lots of applications that
accept free-form text, and that may include recently-assigned code
points not yet recognized by Postgres.
But it would offer protection/stability for some databases. It makes it
possible to have a hard guarantee that Unicode normalization is
stable[1]. And it may also mitigate the risk of collation changes --
using unassigned code points carries a high risk that the collation
order changes as soon as the collation provider recognizes the
assignment. (Though assigned code points can change, too, so limiting
yourself to assigned code points is only a mitigation.)
I worry slightly that users will think at first that they want only
assigned code points, and then later figure out that the application
has increased in scope and now takes all kinds of free-form text. In
that case, the user can "ALTER DATABASE ... STRICT_UNICODE FALSE", and
follow up with some "CHECK (unicode_assigned(...))" constraints on the
particular fields that they'd like to protect.
There's some weirdness that the set of assigned code points as Postgres
sees it may not match what a collation provider sees due to differing
Unicode versions. That's not great -- perhaps we could check that code
points are considered assigned by *both* Postgres and ICU. I don't know
if there's a way to tell if libc considers a code point to be assigned.
Regards,
Jeff Davis
[1]
https://www.unicode.org/policies/stability_policy.html#Normalization
Attachment | Content-Type | Size |
---|---|---|
v1-0001-CREATE-DATABASE-.-STRICT_UNICODE.patch | text/x-patch | 31.4 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Jacob Champion | 2024-03-01 01:08:01 | Re: [PoC] Federated Authn/z with OAUTHBEARER |
Previous Message | Melanie Plageman | 2024-03-01 00:29:45 | Re: BitmapHeapScan streaming read user and prelim refactoring |