Re: pg_collation.collversion for C.UTF-8

From: "Daniel Verite" <daniel(at)manitou-mail(dot)org>
To: "Jeff Davis" <pgsql(at)j-davis(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: pg_collation.collversion for C.UTF-8
Date: 2023-06-05 17:43:26
Message-ID: 5ad8d2f8-c11f-46d6-aab5-ed529d8e958a@manitou-mail.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Jeff Davis wrote:

> > For libc: this change may affect any user who happened to have
> > LANG=C.UTF-8 in their environment at initdb time, which is probably a
> > lot of users, and some buildfarm members. However, the average risk
> > seems to be much lower, because we've gone a long time with the
> > assumption that C.UTF-8 has the same behavior as C, and this only
> > recently came up.

Currently, neither lc_collate_is_c() nor lookup_collation_cache()
think that C.UTF-8 is a C collation, since they do that kind of test:

if (strcmp(localeptr, "C") == 0)
result = true;
else if (strcmp(localeptr, "POSIX") == 0)
result = true;
else
result = false;

What is relatively new (v15) is that we compute a version for libc
collations in get_collation_actual_version(), with code that assumes
that C.* does not need a version, implying that it's immune to
Unicode changes. What came up in this thread is that this assumption
is not true for at least one major platform: Debian/Ubuntu for
releases occurring before 2022 (glibc < 2.35).

> We can avoid this risk by converting C.anything or POSIX.anything to
> plain "C" or "POSIX", respectively, for new collations before storing
> the string in the catalog. For upgraded collations, we can preserve the
> existing locale name. When opening the locale, we would still only
> recognize plain "C" and "POSIX" as the C locale.

Then Postgres would not sort the same as the operating system with the
same locale, at least on some OS. Concerning glibc, after waiting a
few years, glibc<2.35 will be obsolete, and C.UTF-8 sorting like C
will happen by itself.
But in the meantime, personally I don't quite see why Postgres should
start forcing C.UTF-8 to sort differently in the database than in the
OS.

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2023-06-05 17:54:58 Re: Order changes in PG16 since ICU introduction
Previous Message Heikki Linnakangas 2023-06-05 17:43:17 Re: Let's make PostgreSQL multi-threaded