From: | "Daniel Verite" <daniel(at)manitou-mail(dot)org> |
---|---|
To: | "Jeff Davis" <pgsql(at)j-davis(dot)com> |
Cc: | Robert Haas <robertmhaas(at)gmail(dot)com>, Jeremy Schneider <schneider(at)ardentperf(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Built-in CTYPE provider |
Date: | 2023-12-20 12:49:20 |
Message-ID: | dd9261f4-7a98-4565-93ec-336c1c110d90@manitou-mail.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Jeff Davis wrote:
> But there are a lot of users for whom neither of those things are true,
> and it makes zero sense to order all of the text indexes in the
> database according to any one particular locale. I think these users
> would prioritize stability and performance for the database collation,
> and then use COLLATE clauses with ICU collations where necessary.
+1
> I am also still concerned that we have the wrong defaults. Almost
> nobody thinks libc is a great provider, but that's the default, and
> there were problems trying to change that default to ICU in 16. If we
> had a builtin provider, that might be a better basis for a default
> (safe, fast, always available, and documentable). Then, at least if
> someone picks a different locale at initdb time, they would be doing so
> intentionally, rather than implicitly accepting index corruption risks
> based on an environment variable.
Yes. The introduction of the bytewise-sorting, locale-agnostic
C.UTF-8 in glibc is also a step in the direction of providing better
defaults for apps like Postgres, that need both long-term stability
in sorts and Unicode coverage for ctype-dependent functions.
But C.UTF-8 is not available everywhere, and there's still the
problem that Unicode updates through libc are not aligned
with Postgres releases.
ICU has the advantage of cross-OS compatibility,
but it does not provide any collation with bytewise sorting
like C or C.UTF-8, and we don't allow a combination like
"C" for sorting and ICU for ctype operations. When opting
for a locale provider, it has to be for both sorting
and ctype, so an installation that needs cross-OS
compatibility, good Unicode support and long-term stability
of indexes cannot get that with ICU as we expose it
today.
If the Postgres default was bytewise sorting+locale-agnostic
ctype functions directly derived from Unicode data files,
as opposed to libc/$LANG at initdb time, the main
annoyance would be that "ORDER BY textcol" would no
longer be the human-favored sort.
For the presentation layer, we would have to write for instance
ORDER BY textcol COLLATE "unicode" for the root collation
or a specific region-country if needed.
But all the rest seems better, especially cross-OS compatibity,
truly immutable and faster indexes for fields that
don't require linguistic ordering, alignment between Unicode
updates and Postgres updates.
Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
From | Date | Subject | |
---|---|---|---|
Next Message | Pavel Borisov | 2023-12-20 12:51:23 | Re: Table AM Interface Enhancements |
Previous Message | Zhijie Hou (Fujitsu) | 2023-12-20 12:43:55 | RE: Synchronizing slots from primary to standby |