Re: Update Unicode data to Unicode 16.0.0

From: Jeremy Schneider <schneider(at)ardentperf(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Joe Conway <mail(at)joeconway(dot)com>, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Update Unicode data to Unicode 16.0.0
Date: 2025-03-19 05:25:39
Message-ID: 20250318222539.5f1b5b2f@ardentperf.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 18 Mar 2025 19:33:00 -0700
Jeff Davis <pgsql(at)j-davis(dot)com> wrote:

> If we compare the following two problems:
>
> A. With glibc or ICU, every text index, including primary keys, are
> highly vulnerable to inconsistencies after an OS upgrade, even if
> there's no Postgres upgrade; vs.
>
> B. With the builtin provider, only expression indexes and a few
> other things are vulnerable, only during a major version upgrade, and
> mostly (but not entirely) when using recently-assigned Cased letters.
>
> To me, problem A seems about 100 times worse than B almost any way I
> can imagine measuring it: number of objects vulnerable, severity of
> the problem when it does happen, likelihood of a vulnerable object
> having an actual problem, etc. If you disagree, I'd like to hear more.

Jeff - you and several others have literally put years into making this
better, and it's deeply appreciated. I agree that with the builtin
provider we're in a much better place.

I don't quite understand Tom's argument about why Unicode 15 must
eventually become untenable. Why are we assuming it will? In Oracle's
entire history, I think they have only ever supported four versions of
Unicode. [1] MySQL seems to have added their second only recently. [2]
And again - we have ICU if I need the latest emoji characters. Frankly,
Unicode 15 is pretty good. Most updates to unicode these days are fairly
minor.

Maybe Postgres can be the first database to always ship support for the
latest Unicode with each major version - but I think we should design
that right if we're going to do it. If we just stay on Unicode 15 for
now then there are no problems with case insensitive indexes or range
partitioned tables returning wrong query results after a major version
upgrades.

There's been a lot of discussion about indexes, but this SQL also seems
to work:

postgres=# create table test_events(customer_name text, ts timestamp,
message text) partition by range((lower(customer_name)));

I'm sure that people shouldn't do this ... but if anyone /did/ then it
wouldn't be as simple as an index rebuild after their major version
upgrade.

I had never really considered it before, but this SQL also seems to work

postgres=# create table test_events(id uuid, ts timestamp, message
text) partition by range((ts at time zone 'America/Sao_Paulo'));

I'm sure that people shouldn't do that either ... but if anyone did then
would their rows would be in the wrong partition after they upgraded
from 11.4 to 11.5?

The difficulty here is that I work at a company with thousands of
developers and lots of Postgres and I see people do things all the time
that we might think they "shouldnt" do.

Before we bump the unicode version, perseonally I'd just like to have
some tools to make it so people actually can't do the things they
shouldn't do.

-Jeremy

1:
https://docs.oracle.com/en/database/oracle/oracle-database/23/nlspg/appendix-A-locale-data.html#GUID-CC85A33C-81FC-4E93-BAAB-1B3DB9036060__CIABEDHB

2:
https://dev.mysql.com/blog-archive/mysql-character-sets-unicode-and-uca-compliant-collations/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2025-03-19 05:26:47 Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits
Previous Message Michael Paquier 2025-03-19 05:15:11 Re: Proposal - Allow extensions to set a Plan Identifier