Quick Links

Re: Update Unicode data to Unicode 16.0.0

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Joe Conway <mail(at)joeconway(dot)com>, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, Jeremy Schneider <schneider(at)ardentperf(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Update Unicode data to Unicode 16.0.0
Date:	2025-03-21 20:45:24
Message-ID:	d04688cb65619f3c006352763bc8285e1ce3537a.camel@j-davis.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Fri, 2025-03-21 at 10:45 -0400, Robert Haas wrote:
> We might need a way for ALTER DATABASE to allow the
> database default to be adjusted. I'm not quite sure here, but my
> general feeling is that Unicode version feels like part of the
> collation and that we should avoid introducing a separate mechanism
> if
> possible. What are your thoughts?

My (early stage) plans are to have two new shared catalogs,
pg_ctype_provider and pg_collation_provider. Objects would depend on
records in those shared catalogs, which would each have a version. We'd
eventually allow multiple records with providerkind=icu, for instance,
and have some way to choose which one to use (perhaps new objects get
the default version, old objects keep the old version, or something).

The reason to have two shared catalogs is because some objects depend
on collation behavior and some on ctype behavior. If there's an index
on "t COLLATE PG_C_UTF8" then there would be no direct dependency from
the index to the builtin provider in either catalog, because collation
behavior in the builtin provider is unversioned memcmp. But if there's
an index on "LOWER(t COLLATE PG_C_UTF8)", then it would have a
dependency entry to the builtin provider's entry in pg_ctype_provider.

>
> I'm curious why you think this. My own feeling (as I think you
> probably know, but just to be clear) is that relatively few people
> need extremely precise control over their collation behavior, but
> there are some who do. However, I think there are many people for
> whom
> a code-point sort won't be good enough.

You can use ICU for sorting without using it for the index comparators.
Using ICU in the index comparators is an implementation detail that's
only required for unique indexes over non-deterministic collations. And
if it's not used for the index comparators, then most of the problems
go away, and versioning is not nearly so important.

Sure, there are some cases where using ICU in the index comparator is
important, and I'm not suggesting that we remove functionality. But I
believe that using libc or ICU for index comparators is the wrong
default behavior -- high downsides and low upsides for most text
indexes that have ever been created.

Even if there is an ORDER BY, using an index is often the wrong thing
unless it's an index only scan. Text indexes are rarely correlated with
the heap, so it would lead to a lot of random heap fetches, and it's
often better to just execute the query and do a final sort. The
situations where ICU in the comparator is a good idea are special cases
of special cases.

I've posted about this in the past, and got universal disagreement. But
I believe others will eventually come to the same conclusion that I
did.

>
> Maybe we should actually move in the direction of having encodings
> that are essentially specific versions of Unicode. Instead of just
> having UTF-8 that accepts whatever, you could have UTF-8.v16.0.0 or
> whatever, which would only accept code points known to that version
> of
> Unicode. Or maybe this shouldn't be entirely new encodings but
> something vaguely akin to a typmod, so that you could have columns of
> type text[limited_to_unicode_v16_0_0] or whatever. If we actually
> exclude unassigned code points, then we know they aren't there, and
> we
> can make deductions about what is safe to do based on that
> information.

I like this line of thinking, vaguely similar to my STRICT_UNICODE
database option proposal. Maybe these aren't exactly the right things
to do, but I think there are some possibilities here, and we shouldn't
give up and assume there's a problem when usually there is not.

It reminds me of fast-path locking: sure, there *might* be DDL
happening while I'm trying to do a simple SELECT query. But probably
not, so let's make it the responsibility of DDL to warn others that
it's doing something, rather than the responsibility of the SELECT
query.

Regards,
Jeff Davis

In response to

Re: Update Unicode data to Unicode 16.0.0 at 2025-03-21 14:45:47 from Robert Haas

Responses

Re: Update Unicode data to Unicode 16.0.0 at 2025-03-23 04:00:31 from Jeremy Schneider

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andres Freund	2025-03-21 20:55:27	acronym, glossary and other related matters in the docs
Previous Message	Masahiko Sawada	2025-03-21 20:30:13	Re: Separate GUC for replication origins