From: | Jeff Davis <pgsql(at)j-davis(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Pre-proposal: unicode normalized text |
Date: | 2023-10-03 19:54:46 |
Message-ID: | 3941663a8e2f185d6acbbbc4f172c41dd3cfb6fe.camel@j-davis.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, 2023-10-02 at 16:06 -0400, Robert Haas wrote:
> It seems to me that this overlooks one of the major points of Jeff's
> proposal, which is that we don't reject text input that contains
> unassigned code points. That decision turns out to be really painful.
Yeah, because we lose forward-compatibility of some useful operations.
> Here, Jeff mentions normalization, but I think it's a major issue
> with
> collation support. If new code points are added, users can put them
> into the database before they are known to the collation library, and
> then when they become known to the collation library the sort order
> changes and indexes break.
The collation version number may reflect the change in understanding
about assigned code points that may affect collation -- though I'd like
to understand whether this is guaranteed or not.
Regardless, given that (a) we don't have a good story for migrating to
new collation versions; and (b) it would be painful to rebuild indexes
even if we did; then you are right that it's a problem.
> Would we endorse a proposal to make
> pg_catalog.text with encoding UTF-8 reject code points that aren't
> yet
> known to the collation library? To do so would be tighten things up
> considerably from where they stand today, and the way things stand
> today is already rigid enough to cause problems for some users.
What problems exist today due to the rigidity of text?
I assume you mean because we reject invalid byte sequences? Yeah, I'm
sure that causes a problem for some (especially migrations), but it's
difficult for me to imagine a database working well with no rules at
all for the the basic data types.
> Now, there is still the question of whether such a data type would
> properly belong in core or even contrib rather than being an
> out-of-core project. It's not obvious to me that such a data type
> would get enough traction that we'd want it to be part of PostgreSQL
> itself.
At minimum I think we need to have some internal functions to check for
unassigned code points. That belongs in core, because we generate the
unicode tables from a specific version.
I also think we should expose some SQL functions to check for
unassigned code points. That sounds useful, especially since we already
expose normalization functions.
One could easily imagine a domain with CHECK(NOT
contains_unassigned(a)). Or an extension with a data type that uses the
internal functions.
Whether we ever get to a core data type -- and more importantly,
whether anyone uses it -- I'm not sure.
> But at the same time I can certainly understand why Jeff finds
> the status quo problematic.
Yeah, I am looking for a better compromise between:
* everything is memcmp() and 'á' sometimes doesn't equal 'á'
(depending on code point sequence)
* everything is constantly changing, indexes break, and text
comparisons are slow
A stable idea of unicode normalization based on using only assigned
code points is very tempting.
Regards,
Jeff Davis
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2023-10-03 20:07:30 | Re: Annoying build warnings from latest Apple toolchain |
Previous Message | James Coleman | 2023-10-03 19:35:15 | Re: [DOCS] HOT - correct claim about indexes not referencing old line pointers |