Re: Windows UTF-8, non-ICU collation trouble

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Windows UTF-8, non-ICU collation trouble
Date: 2019-12-10 02:41:15
Message-ID: CA+hUKGLRbkx1htdWpOzuqbqxADqmeY5ekWwojbq1LeLRQbZmQg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Dec 6, 2019 at 8:33 PM Noah Misch <noah(at)leadboat(dot)com> wrote:
> On Fri, Dec 06, 2019 at 07:56:08PM +1300, Thomas Munro wrote:
> > On Fri, Dec 6, 2019 at 7:34 PM Noah Misch <noah(at)leadboat(dot)com> wrote:
> > > We use system UTF-16 collation to implement UTF-8 collation on Windows. The
> > > PostgreSQL security team received a report, from Timothy Kuun, that this
> > > collation does not uphold the "symmetric law" and "transitive law" that we
> > > require for btree operator classes. The attached test program demonstrates
> > > this. http://www.delphigroups.info/2/62/478610.html quotes reports of that
> > > problem going back eighteen years. Most code points are unaffected. Indexing
> > > an affected code point using such a collation can cause btree index scans to not
> > > find a row they should find and can make a UNIQUE or PRIMARY KEY constraint
> > > admit a duplicate. The security team determined that this doesn't qualify as a
> > > security vulnerability, but it's still a bug.
> >
> > Huh. Does this apply in modern times? Since Windows 10, I thought
> > they adopted[1] CLDR data to drive that, the same definitions used (or
> > somewhere in the process of being adopted by) GNU, Illumos, FreeBSD
> > etc. Basically, everyone gave up on trying to own this rats nest of a
> > problem and deferred to the experts.
>
> Based on my test program, it applies to Windows Server 2016. I didn't test
> newer versions.

I ran a variation of your program on Appveyor's Studio/Server 2019
image, and the result was the same: it thinks that cmp(s1, s2) == 0,
cmp(s2, s3) == 0, but cmp(s1, s3) == 1, so the operator fails to be
transitive.

These strings include combining characters probably used in a weird
way[1][2], and they confuse my terminal software. Although they don't
all agree, the 3 other implementations I tried gave self-consistent
answers. All of the below expressions are true:

create table t as select e'\u11a7\u1188\ud7a2' s1,
e'\u11a7\ud7a2\u1188' s2, e'\ud7a2\u11a7\u1188' s3;
select s1 > s2 collate "ko-x-icu", s2 > s3 collate "ko-x-icu", s1 > s3
collate "ko-x-icu" from t; [ICU]
select s1 > s2 collate "ko_KR.UTF-8", s2 > s3 collate "ko_KR.UTF-8",
s1 > s3 collate "ko_KR.UTF-8" from t; [FreeBSD]
select s1 < s2 collate "ko_KR.utf8", s2 < s3 collate "ko_KR.utf8", s1
< s3 collate "ko_KR.utf8" from t; [GNU]

> > If you can still get
> > index-busting behaviour out of modern Windows collations, wouldn't
> > that be a bug that someone can file against SQL Server, Windows etc
> > and get fixed?
>
> Perhaps. I wouldn't have high hopes, given the behavior's long tenure and the
> risk of breaking a different set of applications.

I found a SQL Server test website[3] and tried to get it to do
something strange, using "Windows" collations (the ones that are
supposed to be compatible with CompareString() AKA strcoll(), much
like our "libc" provider). For Latin1_General_100_CI_AS_SC_UTF8 and
Korean_100_CS_AS it insisted that cmp(s1, s2) == 1, cmp(s2, s3) == 0,
cmp(s1, s3) == 1, while for Korean_90_CS_AS it said -1, 0, -1, all
self-consistent answers, matching neither your results nor the results
of other implementations. Perhaps my query is wrong, or perhaps I
don't understand how to find the right collation name to break
something, or perhaps it does some kind of normalisation or other
transformation on the strings. Clearly I failed to access the same
code that your sort-locale.c test reaches.

create table t (s1 nvarchar(10) collate Korean_100_CS_AS,
s2 nvarchar(10) collate Korean_100_CS_AS,
s3 nvarchar(10) collate Korean_100_CS_AS);

insert into t values (concat(nchar(0x11a7), nchar(0x1188), nchar(0xd7a2)),
concat(nchar(0x11a7), nchar(0xd7a2), nchar(0x1188)),
concat(nchar(0xd7a2), nchar(0x11a7), nchar(0x1188)));

select case when s1 < s2 then -1 when s1 > s2 then 1 else 0 end,
case when s2 < s3 then -1 when s2 > s3 then 1 else 0 end,
case when s1 < s3 then -1 when s1 > s3 then 1 else 0 end
from t;

[1] https://en.wikipedia.org/wiki/List_of_Hangul_jamo
[2] https://www.unicode.org/faq/korean.html
[3] https://sqltest.net/#927795

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2019-12-10 02:56:26 Re: Index corruption / planner issue with one table in my pg 11.6 instance
Previous Message Michael Paquier 2019-12-10 02:33:38 Re: Index corruption / planner issue with one table in my pg 11.6 instance