Quick Links

Initcap works differently with different locale providers

From:	Oleg Tselebrovskiy <o(dot)tselebrovskiy(at)postgrespro(dot)ru>
To:	pgsql-docs(at)lists(dot)postgresql(dot)org
Subject:	Initcap works differently with different locale providers
Date:	2024-09-25 15:13:24
Message-ID:	804cc10ef95d4d3b298e76b181fd9437@postgrespro.ru
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-docs

Greetings, everyone!

One of our clients has found a difference in behaviour of initcap
function when
using different locale providers, shown below

postgres=# create database test_db_1 locale_provider=icu
locale="ru_RU.UTF-8" template=template0;
NOTICE: using standard form "ru-RU" for ICU locale "ru_RU.UTF-8"
CREATE DATABASE
postgres=# \c test_db_1;
You are now connected to database "test_db_1" as user "postgres".
test_db_1=# select initcap('ЧиЮ А.Ю.');
initcap
----------
Чию А.ю.
(1 row)
test_db_1=# select initcap('joHn d.e.');
initcap
-----------
John D.e.
(1 row)
postgres=# create database test_db_2 locale_provider=libc
locale="ru_RU.UTF-8" template=template0;
CREATE DATABASE
postgres=# \c test_db_2
You are now connected to database "test_db_2" as user "postgres".
test_db_2=# select initcap('ЧиЮ А.Ю.');
initcap
----------
Чию А.Ю.
(1 row)
test_db_2=# select initcap('joHn d.e.');
initcap
-----------
John D.E.
(1 row)

And an easier reproduction (should work for REL_12_STABLE and up)

postgres=# SELECT initcap('first.second' COLLATE "en-x-icu");
initcap
--------------
First.second
(1 row)
postgres=# SELECT initcap('first.second' COLLATE "en_US");
initcap
--------------
First.Second
(1 row)

This behaviour is reproducible on REL_12_STABLE and up to master

I don't believe that this is an erroneous behaviour, just a differing
one, hence
just a documentation change proposition

I suggest adding a clarification that this function works differently
with libc
and ICU providers because there is a difference in what a "word" is
between them

In libc a word is a sequence of alphanumeric characters, separated by
non-alphanumeric characters (as it is written in documentation right
now)
In ICU words are divided according to Unicode® Standard Annex #29 [1]

Similar issue was briefly discussed in [2]

The suggested documentation patch is attached (versions for
REL_13_STABLE+ and
for REL_12_STABLE only)

[1]: https://www.unicode.org/reports/tr29/#Word_Boundaries
[2]:
https://www.postgresql.org/message-id/CAEwbS1R8pwhRkwRo3XsPt24ErBNtFWuReAZhVPJwA3oqo148tA%40mail.gmail.com

Oleg Tselebrovskiy, Postgres Professional

Attachment	Content-Type	Size
v1-0001-string-functions.patch	text/x-diff	952 bytes
v1-0002-string-functions-REL_12.patch	text/x-diff	931 bytes

Browse pgsql-docs by date

	From	Date	Subject
Next Message	Egor Rogov	2024-09-26 09:11:16	Parallel safety restriction in 17
Previous Message	Daniel Gustafsson	2024-09-25 11:24:25	Re: Count parameter for cursor_to_xml