From: | "Daniel Verite" <daniel(at)manitou-mail(dot)org> |
---|---|
To: | "Andreas Karlsson" <andreas(at)proxel(dot)se> |
Cc: | "Peter Eisentraut" <peter(dot)eisentraut(at)2ndquadrant(dot)com>,"pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: insensitive collations |
Date: | 2019-01-14 16:21:34 |
Message-ID: | ef84c67b-cfa9-4a3f-b0ae-e9ff81e9d948@manitou-mail.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Andreas Karlsson wrote:
> > Nondeterministic collations do address this by allowing canonically
> > equivalent code point sequences to compare as equal. You still need a
> > collation implementation that actually does compare them as equal; ICU
> > does this, glibc does not AFAICT.
>
> Ah, right! You could use -ks-identic[1] for this.
Strings that differ like that are considered equal even at this level:
postgres=# create collation identic (locale='und-u-ks-identic',
provider='icu', deterministic=false);
CREATE COLLATION
postgres=# select 'é' = E'e\u0301' collate "identic";
?column?
----------
t
(1 row)
There's a separate setting "colNormalization", or "kk" in BCP 47
From
http://www.unicode.org/reports/tr35/tr35-collation.html#Normalization_Setting
"The UCA always normalizes input strings into NFD form before the
rest of the algorithm. However, this results in poor performance.
With normalization=off, strings that are in [FCD] and do not contain
Tibetan precomposed vowels (U+0F73, U+0F75, U+0F81) should sort
correctly. With normalization=on, an implementation that does not
normalize to NFD must at least perform an incremental FCD check and
normalize substrings as necessary"
But even setting this to false does not mean that NFD and NFC forms
of the same text compare as different:
postgres=# create collation identickk (locale='und-u-ks-identic-kk-false',
provider='icu', deterministic=false);
CREATE COLLATION
postgres=# select 'é' = E'e\u0301' collate "identickk";
?column?
----------
t
(1 row)
AFAIU such strings may only compare as different when they're not
in FCD form (http://unicode.org/notes/tn5/#FCD)
There are also ICU-specific explanations about FCD here:
http://source.icu-project.org/repos/icu/icuhtml/trunk/design/collation/ICU_collation_design.htm#Normalization
It looks like setting colNormalization to false might provide a
performance benefit when you know your contents are in FCD
form, which is mostly the case according to ICU:
"Note that all NFD strings are in FCD, and in practice most NFC
strings will also be in FCD; for that matter most strings (of whatever
ilk) will be in FCD.
We guarantee that if any input strings are in FCD, that we will get
the right results in collation without having to normalize".
Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite
From | Date | Subject | |
---|---|---|---|
Next Message | James Coleman | 2019-01-14 16:25:07 | Re: Proving IS NOT NULL inference for ScalarArrayOpExpr's |
Previous Message | Tom Lane | 2019-01-14 16:08:27 | Re: Proving IS NOT NULL inference for ScalarArrayOpExpr's |