From: | Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com> |
---|---|
To: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | strcmp() tie-breaker for identical ICU-collated strings |
Date: | 2017-06-01 18:58:51 |
Message-ID: | CAJ3gD9ez9O7scYT46iRXU-1KDfDeSQvJ2Ekzxs7RYGofYfB4cg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
While comparing two text strings using varstr_cmp(), if *strcoll*()
call returns 0, we do strcmp() tie-breaker to do binary comparison,
because strcoll() can return 0 for non-identical strings :
varstr_cmp()
{
...
/*
* In some locales strcoll() can claim that nonidentical strings are
* equal. Believing that would be bad news for a number of reasons,
* so we follow Perl's lead and sort "equal" strings according to
* strcmp().
*/
if (result == 0)
result = strcmp(a1p, a2p);
...
}
But is this supposed to apply for ICU collations as well ? If
collation provider is icu, the comparison is done using
ucol_strcoll*(). I suspect that ucol_strcoll*() intentionally returns
some characters as being identical, so doing strcmp() may not make
sense.
For e.g. , if the below two characters are compared using
ucol_strcollUTF8(), it returns 0, meaning the strings are identical :
Greek Oxia : UTF-16 encoding : 0x1FFD
(http://www.fileformat.info/info/unicode/char/1ffd/index.htm)
Greek Tonos : UTF-16 encoding : 0x0384
(http://www.fileformat.info/info/unicode/char/0384/index.htm)
The characters are displayed like this :
postgres=# select (U&'\+001FFD') , (U&'\+000384') collate ucatest;
?column? | ?column?
----------+----------
´ | ΄
(Although this example has similar looking characters, this might not
be a factor behind treating them equal)
Now since ucol_strcoll*() returns 0, these strings are always compared
using strcmp(), so 1FFD > 0384 returns true :
create collation ucatest (locale = 'en_US.UTF8', provider = 'icu');
postgres=# select (U&'\+001FFD') > (U&'\+000384') collate ucatest;
?column?
----------
t
Whereas, if strcmp() is skipped for ICU collations :
if (result == 0 && !(mylocale && mylocale->provider == COLLPROVIDER_ICU))
result = strcmp(a1p, a2p);
... then the comparison using ICU collation tells they are identical strings :
postgres=# select (U&'\+001FFD') > (U&'\+000384') collate ucatest;
?column?
----------
f
(1 row)
postgres=# select (U&'\+001FFD') < (U&'\+000384') collate ucatest;
?column?
----------
f
(1 row)
postgres=# select (U&'\+001FFD') <= (U&'\+000384') collate ucatest;
?column?
----------
t
Now I have verified that strcoll() returns true for 1FFD > 0384. So,
it looks like ICU API function ucol_strcoll() returns false by
intention. That's the reason I feel like the
strcmp-if-strtoll-returns-0 thing might not be applicable for ICU. But
I may be wrong, please correct me if I may be missing something.
--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company
From | Date | Subject | |
---|---|---|---|
Next Message | Jeevan Ladhe | 2017-06-01 19:35:03 | Re: Adding support for Default partition in partitioning |
Previous Message | Andres Freund | 2017-06-01 18:28:46 | Re: logical replication busy-waiting on a lock |