From: | Tobias Bussmann <t(dot)bussmann(at)gmx(dot)net> |
---|---|
To: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Cc: | Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
Subject: | Re: Collation version tracking for macOS |
Date: | 2022-06-10 00:48:33 |
Message-ID: | 30766D63-C51C-4000-B270-501635F58E43@gmx.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Thanks for picking this up!
> How can I see evidence of this? I'm comparing Debian, FreeBSD and
> macOS 12.4 and when I run "LC_COLLATE=en_US.UTF-8 sort
> /usr/share/dict/words" I get upper and lower case mixed together on
> the other OSes, but on the Mac the upper case comes first, which is my
> usual smoke test for "am I looking at binary sort order?"
Perhaps I can shed some light on this matter:
Apple's libc collations have always been a bit special in that concern, even for the non-UTF8 ones. Rooted in ancient FreeBSD they "try to keep collating table backward compatible with ASCII" thus upper and lower cases characters are separated (There are exceptions like 'cs_CZ.ISO8859-2'). The latest public sources I can find are in adv_cmds-119 [1] which belongs to OSX 10.5 [2] - these correspond to the ones used in FreeBSD till v10 [3], whereby the timestamps rather point its origin around FreeBSD 5. Further, there are only very few locales actually present on macOS (36 - none of it supporting Unicode) and these have not changed for a very long time (I verified that from OS X 10.6.8 till macOS 12.4 [4], exception is a 'de_DE-A.ISO8859-1' present only in macOS 10.15).
What they do instead is symlinking [5] missing collations to similar ones even across encodings, often resulting in la_LN.US-ASCII ('la_LN' seem to stand for a Latin meta language) being used which is exactly byte order [6]. These symlinks have not changed [7] from OS X 10.6.8 till macOS 10.15.7. But in macOS 11 many of these symlinks changed their target. So did the popular 'en_US.UTF-8' from 'la_LN.US-ASCII' to 'la_LN.ISO8859-1' or 'de_DE.UTF-8' from 'la_LN.US-ASCII' to 'de_DE.ISO8859-1'. In effect, about half of the UTF-8 collations change from no collation to partial/broken collation support. macOS 12 again shows no changes - tests for macOS 13 are outstanding.
# tl:dr;
With your smoke test "sort /usr/share/dict/words" on a modern macOS you won't see a difference between "C" and "en_US.UTF-8" but with "( echo '5£'; echo '£5' ) | LC_COLLATE=en_US.UTF-8 sort" you can produce a difference against "( echo '5£'; echo '£5' ) | LC_COLLATE=C sort". Or test with "diff -q <(LC_COLLATE=C sort /usr/share/dict/words) <(LC_COLLATE=es_ES.UTF-8 sort /usr/share/dict/words)"
The upside is that we don't have to cope with the new characters added in every version of Unicode (although I have not examined LC_CTYPE yet).
best regards
Tobias
[1]: https://github.com/apple-oss-distributions/adv_cmds/tree/adv_cmds-119/usr-share-locale.tproj/colldef
[2]: https://opensource.apple.com/releases/
[3]: https://github.com/freebsd/freebsd-src/tree/stable/10/share/colldef
[4]: find /usr/share/locale/*/LC_COLLATE -type f -exec md5 {} \;
[5]: https://github.com/apple-oss-distributions/adv_cmds/blob/adv_cmds-119/usr-share-locale.tproj/colldef/BSDmakefile
[6]: https://github.com/apple-oss-distributions/adv_cmds/blob/adv_cmds-119/usr-share-locale.tproj/colldef/la_LN.US-ASCII.src
[7]: find /usr/share/locale/*/LC_COLLATE -type l -exec stat -f "%N%SY" {} \;
From | Date | Subject | |
---|---|---|---|
Next Message | Thomas Munro | 2022-06-10 00:58:45 | Re: Collation version tracking for macOS |
Previous Message | Peter Geoghegan | 2022-06-10 00:32:23 | Re: Collation version tracking for macOS |