Re: ICU locale validation / canonicalization

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: ICU locale validation / canonicalization
Date: 2023-02-10 17:53:58
Message-ID: 33acfc3a772224d668042bd2cbef88e91704ce25.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 2023-02-10 at 09:43 -0500, Robert Haas wrote:
> On Thu, Feb 9, 2023 at 5:09 PM Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> > I do like the ICU format locale IDs from a readability standpoint.
> > "en_US(at)colstrength=primary" is more meaningful to me than "en-US-u-
> > ks-
> > level1" (the equivalent language tag).
>
> Sadly, neither of those means a whole lot to me? :-(
>
> How did you find out that those are equivalent?

In our tests you can see colstrength=primary is used to mean "case
insensitive". That's where I picked up the "colstrength" keyword, which
is also present in the ICU sources, but now that you ask I'm embarassed
that I don't see the keyword itself documented very well.

This document
https://unicode-org.github.io/icu/userguide/locale/#keywords
lists keywords, but colstrength is not there. It's easy enough to find
in the ICU source; I'm probably just missing the document.

Here's the API reference, which tells you that you can set the strength
of a collator (using the API, not the keyword):
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/ucol_8h.html#acc801048729e684bcabed328be85f77a

The more precise definitions of the strengths are here:
https://unicode-org.github.io/icu/userguide/collation/concepts.html#comparison-levels

Regarding the equivalence of the two forms, uloc_toLanguageTag() and
uloc_toLanguageTag() are inverses. As far as I can tell (a lower degree
of assurance than you are looking for), if one succeeds, then the other
will also succeed and produce the original result.

There are another couple documents here (TR35):
http://www.unicode.org/reports/tr35/
https://www.unicode.org/reports/tr35/tr35-collation.html#Setting_Options
that seems to cover the "ks-level1" and how it maps to the collation
strength.

My examination of these standards is very superficial -- I'm basically
just checking that they seem to be there. If I search for a string like
"en-US-u-ks-level1", I only find Postgres-related results, so you could
also question whether these standards are actually used.

Using BCP 47 tags for icu locale strings, and moving to ICU (as
discussed in the other thread) is basically a leap of faith in ICU. The
docs aren't perfect, the source is hard to read, and we've found bugs.
But it seems like a better place for us than libc for the reasons I
mentioned in the other thread.

> > And the format is specified[1],
> > even though it's not an independent standard. But I think the
> > benefits
> > of better validation, an independent standard, and the fact that
> > we're
> > already favoring BCP47 outweigh my subjective opinion.
>
> See, I'm confused, because that link says "If a keyword list is
> present it must be preceded by an at-sign" which makes it sound like
> it is talking about stuff like en_US(at)colstrength=primary rather than
> stuff like en-US-u-ks-level1. The examples are all that way too, like
> it gives examples like en_IE(at)currency=IEP and
> fr(at)collation=phonebook;calendar=islamic-civil.

My paragraph was unclear, let me restate the point:

To represent ICU locale strings in the catalog consistently, we have
two choices, which as far as I can tell are equivalent:

1. ICU format Locale IDs. These are more readable, and still specified
(albeit non-standard).

2. BCP47 language tags. These are standardized, there's better
validation with "strict" mode, and we are already using them.

Honestly I don't think it's hugely important which one we pick. But
being consistent is important, so we need to pick one, and BCP 47 seems
like the better option to me.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nikita Malakhov 2023-02-10 18:22:14 Re: [PATCH] Compression dictionaries for JSONB
Previous Message Heikki Linnakangas 2023-02-10 16:38:50 Re: refactoring relation extension and BufferAlloc(), faster COPY