Re: Character expansion with ICU collations

From: "Finnerty, Jim" <jfinnert(at)amazon(dot)com>
To: Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Character expansion with ICU collations
Date: 2021-06-11 20:05:39
Message-ID: 1C0E0C6B-2CC1-4711-9FBE-7EC6B1F6A4F5@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>>
You can have these queries return both rows if you use an
accent-ignoring collation, like this example in the documentation:

CREATE COLLATION ignore_accents (provider = icu, locale =
'und-u-ks-level1-kc-true', deterministic = false);
<<

Indeed. Is the dependency between the character expansion capability and accent-insensitive collations documented anywhere?

Another unexpected dependency appears to be @colCaseFirst=upper. If specified in combination with colStrength=secondary, it appears that the upper/lower case ordering is random within a group of characters that are secondary equal, e.g. 'A' < 'a', but 'b' < 'B', 'c' < 'C', ... , but then 'L' < 'l'. It is not even consistently ordered with respect to case. If I make it a nondeterministic CS_AI collation, then it sorts upper before lower consistently. The rule seems to be that you can't sort by case within a group that is case-insensitive.

Can a CI collation be ordered upper case first, or is this a limitation of ICU?

For example, this is part of the sort order that I'd like to achieve with ICU, with the code point in column 1 and dense_rank() shown in the rightmost column indicating that 'b' = 'B', for example:

66 B B 138 151
98 b b 138 151 <- so within a group that is CI_AS equal, the sort order needs to be upper case first
67 C C 139 152
99 c c 139 152
199 Ç Ç 140 153
231 ç ç 140 153
68 D D 141 154
100 d d 141 154
208 Ð Ð 142 199
240 ð ð 142 199
69 E E 143 155
101 e e 143 155

Can this sort order be achieved with ICU?

More generally, is there any interest in leveraging the full power of ICU tailoring rules to get whatever order someone may need, subject to the limitations of ICU itself? what would be required to extend CREATE COLLATION to accept an optional sequence of tailoring rules that we would store in the pg_collation catalog and apply along with the modifiers in the locale string?

/Jim

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2021-06-11 20:07:36 Re: doc issue missing type name "multirange" in chapter title
Previous Message Robert Haas 2021-06-11 20:05:10 Re: Question about StartLogicalReplication() error path