From: | "Daniel Verite" <daniel(at)manitou-mail(dot)org> |
---|---|
To: | "Jeff Davis" <pgsql(at)j-davis(dot)com> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: pg_collation.collversion for C.UTF-8 |
Date: | 2023-06-07 14:08:05 |
Message-ID: | f3986b9e-0588-48ae-bbf6-26da9c2cbfbd@manitou-mail.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Jeff Davis wrote:
> What about ICU? How should provider=icu locale=C.UTF-8 behave? We
> could:
>
> a. Just pass it to the provider and see what happens (older versions of
> ICU would interpret it as en-US-u-va-posix; newer versions would give
> the root locale).
>
> b. Consistently interpret it as en-US-u-va-posix.
>
> c. Don't pass it to the provider at all and treat it with memcmp
> semantics.
I think b) and c) are quite problematic.
First, en-US-u-va-posix does not sort like C.UTF-8 in glibc.
For one thing it seems that en-US-u-va-posix assigns zero weights to
some codepoints, which makes it semantically definitely different.
For instance consider ZERO WIDTH SPACE (U+200B):
postgres=# select 'ab' < E'a\u200Ba' COLLATE "C.utf8";
?column?
----------
t
postgres=# select 'ab' < E'a\u200Ba' COLLATE "en-US-u-va-posix-x-icu";
?column?
----------
f
Even if ICU folks refer to u-va-posix as approximating POSIX (as in [1]),
for our purpose, either it sorts by codepoints or it does not,
and it clearly does not. One consequence is that
en-US-u-va-posix-x-icu needs to be versioned and indexes
depending on it need to be rebuilt on upgrades.
OTOH the goal with C.UTF-8, that is achieved in glibc>=2.35,
is to not need that.
Also it's not just about sorting. The semantics for the ctype-kind
functions are also different.
Consider matching '\d' in a regexp. With C.UTF-8 (glibc-2.35), we only match
ASCII characters 0-9, or 10 codepoints.
With "en-US-u-va-posix-x-icu" we match 660 codepoints comprising
all the digit characters in all languages, plus a bunch of variants
for mathematical symbols.
For instance consider U+FF10 (Fullwidth Digit Zero):
postgres=# select E'\uff10' collate "C.utf8" ~ '\d';
?column?
----------
f
postgres=# select E'\uff10' collate "en-US-u-va-posix-x-icu" ~ '\d';
?column?
----------
t
If someone dumps their C.UTF-8 database to reload into an
ICU/en-US-u-va-posix database, there is no guarantee that it
even reloads because of semantic differences occuring
in constraints. In general it will surely reload, but the apps
might not behave the same with the new database
in a way that might be problematic.
It's fine if that's what they want and they explicitly ask for this
conversion, but it's not fine if it's postgres that has quietly
decided that for them.
About c) "don't pass it to the operators", it would be doable for
sorting (ignoring the "glibc before 2.35 does not sort like that" issue)
but not for the ctype-kind functions, where postgres' own code
doesn't have the Unicode knowledge.
About a) "just pass it to the provider", that seems better than b) or
c), but still, when a user asks for provider=icu locale=C.UTF-8,
it's a very probably a pilot error.
To me the user would be best served by a warning, if not an error,
informing them that it's quite probably not the combination they want.
Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
From | Date | Subject | |
---|---|---|---|
Next Message | Joseph Koshakow | 2023-06-07 14:15:46 | Re: is_superuser is not documented |
Previous Message | Peter Eisentraut | 2023-06-07 13:25:21 | Re: Improve join_search_one_level readibilty (one line change) |