From: | "Finnerty, Jim" <jfinnert(at)amazon(dot)com> |
---|---|
To: | Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Character expansion with ICU collations |
Date: | 2021-06-12 19:39:25 |
Message-ID: | 9EC3C20F-0721-415A-BE68-CB7240B06A26@amazon.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Re:
>> Can a CI collation be ordered upper case first, or is this a limitation of ICU?
> I don't know the authoritative answer to that, but to me it doesn't make
> sense, since the effect of a case-insensitive collation is to throw away
> the third-level weights, so there is nothing left for "upper case first"
> to operate on.
It wouldn't make sense for the ICU sort key of a CI collation itself because the sort keys need to be binary equal, but what the collation of interest does is equivalent to adding a secondary "C"-collated expression to the ORDER BY clause. For example:
SELECT ... ORDER BY expr COLLATE ci_as;
Is ordered as if the query had been written:
SELECT ... ORDER BY expr COLLATE ci_as, expr COLLATE "C";
Re:
> tailoring rules
>> yes
It looks like the relevant API call is ucol_openRules(),
Interface documented here: https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/ucol_8h.html
example usage from C here: https://android.googlesource.com/platform/external/icu/+/db20b09/source/test/cintltst/citertst.c
for example:
/* Test with an expanding character sequence */
u_uastrcpy(rule, "&a < b < c/abd < d");
c2 = ucol_openRules(rule, u_strlen(rule), UCOL_OFF, UCOL_DEFAULT_STRENGTH, NULL, &status);
and a reordering rule test:
u_uastrcpy(rule, "&z < AB");
coll = ucol_openRules(rule, u_strlen(rule), UCOL_OFF, UCOL_DEFAULT_STRENGTH, NULL, &status);
that looks encouraging. It returns a UCollator object, like ucol_open(const char *localeString, ...), so it's an alternative to ucol_open(). One of the parameters is the equivalent of colStrength, so then the question would be, how are the other keyword/value pairs like colCaseFirst, colAlternate, etc. specified via the rules argument? In the same way with the exception of colStrength?
e.g. is "colAlternate=shifted;&z < AB" a valid rules string?
The ICU documentation says simply:
" rules A string describing the collation rules. For the syntax of the rules please see users guide."
Transform rules are documented here: http://userguide.icu-project.org/transforms/general/rules
But there are no examples of using the keyword/value pairs that may appear in a locale string with the transform rules, and there's no locale argument on ucol_openRules. How can the keyword/value pairs that may appear in the locale string be applied in combination with tailoring rules (with the exception of colStrength)?
From | Date | Subject | |
---|---|---|---|
Next Message | Andrew Dunstan | 2021-06-12 21:19:38 | Re: recovery test failures on hoverfly |
Previous Message | Andres Freund | 2021-06-12 19:27:16 | Re: Signed vs Unsigned (take 2) (src/backend/storage/ipc/procarray.c) |