Quick Links

Re: Update Unicode data to Unicode 16.0.0

From:	Joe Conway <mail(at)joeconway(dot)com>
To:	Peter Eisentraut <peter(at)eisentraut(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Update Unicode data to Unicode 16.0.0
Date:	2024-11-11 19:52:17
Message-ID:	f0bd0304-97b8-4a55-bf16-d1a7feb948e3@joeconway.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 11/11/24 01:27, Peter Eisentraut wrote:
> Here is the patch to update the Unicode data to version 16.0.0.
>
> Normally, this would have been routine, but a few months ago there was
> some debate about how this should be handled. [0] AFAICT, the consensus
> was to go ahead with it, but I just wanted to notify it here to be clear.
>
> [0]:
> https://www.postgresql.org/message-id/flat/d75d2d0d1d2bd45b2c332c47e3e0a67f0640b49c.camel%40j-davis.com

I ran a check and found that this patch causes changes in upper casing
of some characters. Repro:

setup
8<-------------
wget https://joeconway.com/presentations/formated-unicode.txt
initdb
psql
CREATE DATABASE builtincoll
LOCALE_PROVIDER builtin
BUILTIN_LOCALE 'C.UTF-8'
TEMPLATE template0;
\c builtincoll
CREATE TABLE unsorted_table(strings text);
\copy unsorted_table from formated-unicode.txt (format csv)
VACUUM FREEZE ANALYZE unsorted_table;
8<-------------

8<-------------
-- on master
builtincoll=# WITH t AS (SELECT lower(strings) AS s FROM unsorted_table
ORDER BY 1)
SELECT md5(string_agg(t.s,NULL)) FROM t;
md5
----------------------------------
7ec7f5c2d8729ec960942942bb82aedd
(1 row)

builtincoll=# WITH t AS (SELECT upper(strings) AS s FROM unsorted_table
ORDER BY 1)
SELECT md5(string_agg(t.s,NULL)) FROM t;
md5
----------------------------------
97f83a4d1937aa65bcf8be134bf7b0c4
(1 row)

builtincoll=# WITH t AS (SELECT initcap(strings) AS s FROM
unsorted_table ORDER BY 1)
SELECT md5(string_agg(t.s,NULL)) FROM t;
md5
----------------------------------
8cf65a43affc221f3a20645ef402085e
(1 row)
8<-------------

8<-------------
-- master+patch
builtincoll=# WITH t AS (SELECT lower(strings) AS s FROM unsorted_table
ORDER BY 1)
SELECT md5(string_agg(t.s,NULL)) FROM t;
md5
----------------------------------
7ec7f5c2d8729ec960942942bb82aedd
(1 row)

Time: 19858.981 ms (00:19.859)
builtincoll=# WITH t AS (SELECT upper(strings) AS s FROM unsorted_table
ORDER BY 1)SELECT md5(string_agg(t.s,NULL)) FROM t;
md5
----------------------------------
3055b3d5dff76c8c1250ef500c6ec13f
(1 row)

Time: 19774.467 ms (00:19.774)
builtincoll=# WITH t AS (SELECT initcap(strings) AS s FROM
unsorted_table ORDER BY 1)
SELECT md5(string_agg(t.s,NULL)) FROM t;
md5
----------------------------------
9985acddf7902ea603897cdaccd02114
(1 row)
8<-------------

So both UPPER and INITCAP produce different results unless I am missing
something.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

In response to

Update Unicode data to Unicode 16.0.0 at 2024-11-11 06:27:53 from Peter Eisentraut

Responses

Re: Update Unicode data to Unicode 16.0.0 at 2024-11-12 09:40:52 from Laurenz Albe

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Peter Geoghegan	2024-11-11 20:12:20	Re: index prefetching
Previous Message	Jim Jones	2024-11-11 19:43:17	Re: [PoC] XMLCast (SQL/XML X025)