Quick Links

Re: Pre-proposal: unicode normalized text

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Pre-proposal: unicode normalized text
Date:	2023-10-03 19:54:46
Message-ID:	3941663a8e2f185d6acbbbc4f172c41dd3cfb6fe.camel@j-davis.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Mon, 2023-10-02 at 16:06 -0400, Robert Haas wrote:
> It seems to me that this overlooks one of the major points of Jeff's
> proposal, which is that we don't reject text input that contains
> unassigned code points. That decision turns out to be really painful.

Yeah, because we lose forward-compatibility of some useful operations.

> Here, Jeff mentions normalization, but I think it's a major issue
> with
> collation support. If new code points are added, users can put them
> into the database before they are known to the collation library, and
> then when they become known to the collation library the sort order
> changes and indexes break.

The collation version number may reflect the change in understanding
about assigned code points that may affect collation -- though I'd like
to understand whether this is guaranteed or not.

Regardless, given that (a) we don't have a good story for migrating to
new collation versions; and (b) it would be painful to rebuild indexes
even if we did; then you are right that it's a problem.

> Would we endorse a proposal to make
> pg_catalog.text with encoding UTF-8 reject code points that aren't
> yet
> known to the collation library? To do so would be tighten things up
> considerably from where they stand today, and the way things stand
> today is already rigid enough to cause problems for some users.

What problems exist today due to the rigidity of text?

I assume you mean because we reject invalid byte sequences? Yeah, I'm
sure that causes a problem for some (especially migrations), but it's
difficult for me to imagine a database working well with no rules at
all for the the basic data types.

> Now, there is still the question of whether such a data type would
> properly belong in core or even contrib rather than being an
> out-of-core project. It's not obvious to me that such a data type
> would get enough traction that we'd want it to be part of PostgreSQL
> itself.

At minimum I think we need to have some internal functions to check for
unassigned code points. That belongs in core, because we generate the
unicode tables from a specific version.

I also think we should expose some SQL functions to check for
unassigned code points. That sounds useful, especially since we already
expose normalization functions.

One could easily imagine a domain with CHECK(NOT
contains_unassigned(a)). Or an extension with a data type that uses the
internal functions.

Whether we ever get to a core data type -- and more importantly,
whether anyone uses it -- I'm not sure.

> But at the same time I can certainly understand why Jeff finds
> the status quo problematic.

Yeah, I am looking for a better compromise between:

* everything is memcmp() and 'á' sometimes doesn't equal 'á'
(depending on code point sequence)
* everything is constantly changing, indexes break, and text
comparisons are slow

A stable idea of unicode normalization based on using only assigned
code points is very tempting.

Regards,
Jeff Davis

In response to

Re: Pre-proposal: unicode normalized text at 2023-10-02 20:06:09 from Robert Haas

Responses

Re: Pre-proposal: unicode normalized text at 2023-10-04 17:16:22 from Robert Haas
Re: Pre-proposal: unicode normalized text at 2023-10-06 07:58:37 from Peter Eisentraut

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2023-10-03 20:07:30	Re: Annoying build warnings from latest Apple toolchain
Previous Message	James Coleman	2023-10-03 19:35:15	Re: [DOCS] HOT - correct claim about indexes not referencing old line pointers