We use system UTF-16 collation to implement UTF-8 collation on Windows. The
PostgreSQL security team received a report, from Timothy Kuun, that this
collation does not uphold the "symmetric law" and "transitive law" that we
require for btree operator classes. The attached test program demonstrates
this. http://www.delphigroups.info/2/62/478610.html quotes reports of that
problem going back eighteen years. Most code points are unaffected. Indexing
an affected code point using such a collation can cause btree index scans to not
find a row they should find and can make a UNIQUE or PRIMARY KEY constraint
admit a duplicate. The security team determined that this doesn't qualify as a
security vulnerability, but it's still a bug.
All I can think to do is issue a warning whenever a CREATE DATABASE or CREATE
COLLATION combines UTF8 encoding with a locale having this problem. In a
greenfield, I would forbid affected combinations of encoding and locale. That
is too harsh, considering the few code points affected and the difficulty of
changing the collation of existing databases. For CREATE DATABASE, all except
LOCALE=C would trigger the warning. For CREATE COLLATION, ICU locales would
also not trigger the warning. Hence, the chief workaround is to use LOCALE=C at
the database level and ICU collations for indexes and operator invocations.
(The ability to use an ICU collation at the database level would improve the
user experience here.) Better ideas?