From: | Joe Conway <mail(at)joeconway(dot)com> |
---|---|
To: | Peter Eisentraut <peter(at)eisentraut(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Update Unicode data to Unicode 16.0.0 |
Date: | 2024-11-11 19:52:17 |
Message-ID: | f0bd0304-97b8-4a55-bf16-d1a7feb948e3@joeconway.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 11/11/24 01:27, Peter Eisentraut wrote:
> Here is the patch to update the Unicode data to version 16.0.0.
>
> Normally, this would have been routine, but a few months ago there was
> some debate about how this should be handled. [0] AFAICT, the consensus
> was to go ahead with it, but I just wanted to notify it here to be clear.
>
> [0]:
> https://www.postgresql.org/message-id/flat/d75d2d0d1d2bd45b2c332c47e3e0a67f0640b49c.camel%40j-davis.com
I ran a check and found that this patch causes changes in upper casing
of some characters. Repro:
setup
8<-------------
wget https://joeconway.com/presentations/formated-unicode.txt
initdb
psql
CREATE DATABASE builtincoll
LOCALE_PROVIDER builtin
BUILTIN_LOCALE 'C.UTF-8'
TEMPLATE template0;
\c builtincoll
CREATE TABLE unsorted_table(strings text);
\copy unsorted_table from formated-unicode.txt (format csv)
VACUUM FREEZE ANALYZE unsorted_table;
8<-------------
8<-------------
-- on master
builtincoll=# WITH t AS (SELECT lower(strings) AS s FROM unsorted_table
ORDER BY 1)
SELECT md5(string_agg(t.s,NULL)) FROM t;
md5
----------------------------------
7ec7f5c2d8729ec960942942bb82aedd
(1 row)
builtincoll=# WITH t AS (SELECT upper(strings) AS s FROM unsorted_table
ORDER BY 1)
SELECT md5(string_agg(t.s,NULL)) FROM t;
md5
----------------------------------
97f83a4d1937aa65bcf8be134bf7b0c4
(1 row)
builtincoll=# WITH t AS (SELECT initcap(strings) AS s FROM
unsorted_table ORDER BY 1)
SELECT md5(string_agg(t.s,NULL)) FROM t;
md5
----------------------------------
8cf65a43affc221f3a20645ef402085e
(1 row)
8<-------------
8<-------------
-- master+patch
builtincoll=# WITH t AS (SELECT lower(strings) AS s FROM unsorted_table
ORDER BY 1)
SELECT md5(string_agg(t.s,NULL)) FROM t;
md5
----------------------------------
7ec7f5c2d8729ec960942942bb82aedd
(1 row)
Time: 19858.981 ms (00:19.859)
builtincoll=# WITH t AS (SELECT upper(strings) AS s FROM unsorted_table
ORDER BY 1)SELECT md5(string_agg(t.s,NULL)) FROM t;
md5
----------------------------------
3055b3d5dff76c8c1250ef500c6ec13f
(1 row)
Time: 19774.467 ms (00:19.774)
builtincoll=# WITH t AS (SELECT initcap(strings) AS s FROM
unsorted_table ORDER BY 1)
SELECT md5(string_agg(t.s,NULL)) FROM t;
md5
----------------------------------
9985acddf7902ea603897cdaccd02114
(1 row)
8<-------------
So both UPPER and INITCAP produce different results unless I am missing
something.
--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Geoghegan | 2024-11-11 20:12:20 | Re: index prefetching |
Previous Message | Jim Jones | 2024-11-11 19:43:17 | Re: [PoC] XMLCast (SQL/XML X025) |