Re: Update Unicode data to Unicode 16.0.0

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeremy Schneider <schneider(at)ardentperf(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Update Unicode data to Unicode 16.0.0
Date: 2025-03-17 23:15:21
Message-ID: 4edd59b81a56574fbd18ffa88b12f540fb6713fc.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, 2025-03-15 at 12:15 -0400, Tom Lane wrote:
> In fact, on the analogy of timezones, I think we should not only
> adopt newly-published Unicode versions pretty quickly but push
> them into released branches as well.

That approach suggests that we consider something like my previous
STRICT_UNICODE proposal[1]. If Postgres updates Unicode quickly enough,
there's not much reason that users would need to use unassigned code
points, so it would be practical to just reject them (as an option).
That would dramatically reduce the practical problems people would
encounter when we do update Unicode.

Note that assigned code points can still change behavior in later
versions, but not in ways that would typically cause a problem for
things like indexes. For instance, U+0363 changed from non-Alphabetic
to Alphabetic in Unicode 16, which changes the results of the
expression:

U&'\0363' ~ '[[:alpha:]]' COLLATE PG_C_UTF8

from false to true, even though U+0363 is assigned in both Unicode
15.1.0 and 16.0.0. That might plausibly matter, but such cases would be
more obscure than case folding.

Regards,
Jeff Davis

[1] https://commitfest.postgresql.org/patch/4876/

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Corey Huinker 2025-03-17 23:24:46 Re: Statistics Import and Export
Previous Message Jelte Fennema-Nio 2025-03-17 23:08:03 Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup