Re: ICU integration

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Doug Doole <ddoole(at)salesforce(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Devrim Gündüz <devrim(at)gunduz(dot)org>
Subject: Re: ICU integration
Date: 2016-09-09 03:08:22
Message-ID: CAM3SWZQVv3s70tJ6WCmbcO8cVQjnj8ZruVMBNOqc1YpGmq7hFQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Sep 8, 2016 at 6:48 PM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
> Pity ICU doesn't offer versioned collations within a single install.
> Though I can understand why they don't.

There are two separate issues with collator versioning. ICU can
probably be used in a way that clearly decouples these two issues,
which is very important. The first is that the rules of collations
change. The second is that the binary key that collators create (i.e.
the equivalent of strxfrm()) can change for various reasons that have
nothing to do with culture or natural languages -- purely technical
reasons. For example, they can add new optimizations to make
generating new binary keys faster. If there are bugs in how that
works, they can fix the bugs and increment the identifier [1], which
could allow Postgres to insist on a REINDEX (if abbreviated keys for
collated text were reenabled, although I don't think that problems
like that are limited to binary key generation).

So, to bring it back to that little program I wrote:

$ ./icu-coll-versions | head
Collator | ICU Version | UCA Version
-----------------------------------------------------------------------------
Afrikaans | 99-38-00-00 | 07-00-00-00
Afrikaans (Namibia) | 99-38-00-00 | 07-00-00-00
Afrikaans (South Africa) | 99-38-00-00 | 07-00-00-00
Aghem | 99-38-00-00 | 07-00-00-00
Aghem (Cameroon) | 99-38-00-00 | 07-00-00-00
Akan | 99-38-00-00 | 07-00-00-00
Akan (Ghana) | 99-38-00-00 | 07-00-00-00
Amharic | 99-38-00-00 | 07-00-00-00

Here, what appears as "ICU version" has the identifier [1] baked in,
although this is undocumented (it also has any "custom tailorings"
that might be used, say if we had user defined customizations to
collations, as Firebird apparently does [2] [3]). I'm pretty sure that
UCA version relates to a version of the Unicode collation algorithm,
and its associated DUCET table (this is all subject to ISO
standardization). I gather that a particular collation is actually
comprised of a base UCA version (and DUCET table -- I think that ICU
sometimes calls this the "root"), with custom tailorings that a locale
provides for a given culture or country. These collators may in turn
be further "tailored" to get that fancy user defined customization
stuff.

In principle, and assuming I haven't gotten something wrong, it ought
to be possible to unambiguously identify a collation based on a
matching UCA version (i.e. DUCET table), plus the collation tailorings
matching exactly, even across ICU versions that happen to be based on
the same UCA version (they only seem to put out a new UCA version
about once a year [4]). It *might* be fine, practically speaking, to
assume that a collation with a matching iso-code and UCA version is
compatible forever and always across any ICU version. If not, it might
instead be feasible to write a custom fingerprinter for collation
tailorings that ran at initdb time. Maybe the tailorings, which are
abstract rules, could even be stored in system catalogs, so the only
thing that need match is ICU's UCA version (the "root" collators must
still match), since replicas may reconstruct the serialized tailorings
that comprise a collation as needed [5][6], since the tailoring that a
default collator for a locale uses isn't special, technically
speaking.

Of course, this is all pretty hand-wavey right now, and much more
research is needed. I am very intrigued about the idea of storing the
collators in the system catalogs wholesale, since ICU provides
facilities that make that seem possible. If a "serialized unicode set"
build from a collators tailoring rules, or, alternatively, a collator
saved as a binary representation [7] were stored in the system
catalogs, perhaps it wouldn't matter as much that the stuff
distributed with different ICU versions didn't match, at least in
theory. It's unclear that the system catalog representation could be
usable with a fair cross section of ICU versions, but if it could then
that would be perfect. This also seems to be how Firebird style
user-defined tailorings might be implemented anyway, and it seems very
appealing to add that as a light layer on top of how the base system
works, if at all possible.

[1] https://github.com/svn2github/libicu/blob/c43ec130ea0ee6cd565d87d70088e1d70d892f32/common/unicode/uvernum.h#L149
[2] http://www.firebirdsql.org/refdocs/langrefupd25-ddl-collation.html
[3] http://userguide.icu-project.org/collation/customization#TOC-Building-on-Existing-Locales
[4] http://unicode.org/reports/tr10/#Synch_14651_Table
[5] https://ssl.icu-project.org/apiref/icu4c/ucol_8h.html#a1982f184bca8adaa848144a1959ff235
[6] https://ssl.icu-project.org/apiref/icu4c/structUSerializedSet.html
[7] https://ssl.icu-project.org/apiref/icu4c/ucol_8h.html#a2719995a75ebed7aacc1419bb2b781db
--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrey Borodin 2016-09-09 03:18:29 Re: Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]
Previous Message Michael Paquier 2016-09-09 02:51:37 Re: Stopping logical replication protocol