Re: ICU locale validation / canonicalization

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: ICU locale validation / canonicalization
Date: 2023-02-09 21:15:43
Message-ID: 4e4243a2f1ef4a683fcaaa7a5bf7019a602dee17.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 2023-02-09 at 15:44 +0100, Peter Eisentraut wrote:
> One use case is that if a user specifies a locale, say, of 'de-AT',
> this
> might canonicalize to 'de' today,

Canonicalization should not lose useful information, it should just
rearrange it, so I don't see a risk here based on what I read and the
behavior I saw. In ICU, "de-AT" canonicalizes to "de_AT" and becomes
the language tag "de-AT".

> but we should still store what the
> user specified because 1) that documents what the user wanted, and 2)
> it
> might not canonicalize to the same thing tomorrow.

We don't want to store things with ambiguous interpretations that could
change tomorrow; that's a recipe for trouble. That's why most people
store timestamps as the offset from some epoch in UTC rather than as
"2/9/23" (Feb 9 or Sept 2? 1923 or 2023?). There are exceptions where
you would want to store something like that, but I don't see why they'd
apply in this case, where reinterpretation probably means a corrupted
index.

If the user wants to know how their ad-hoc string was interpreted, they
can look at the resulting BCP 47 language tag, and see if it's what
they meant. We can try to make this user-friendly by offering a NOTICE,
WARNING, or helper functions that allow them to explore. We can also
double check that the canonicalized form resolves to the same actual
collator to be safe, and maybe even fall back to whatever the user
specified if not. I'm open to discuss how strict we want to be and what
kind of escape hatches we need to offer.

There is still a risk that the BCP 47 language tag resolves to a
different specific ICU collator or different collator version tomorrow.
That's why we need to be careful about versioning (library versions or
collator versions or both), and we've had long discussions about that.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2023-02-09 21:30:27 Re: pg_usleep for multisecond delays
Previous Message Andres Freund 2023-02-09 21:02:02 Re: Importing pg_bsd_indent into our source tree