Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Cc: Andreas Karlsson <andreas(at)proxel(dot)se>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?
Date: 2017-09-22 20:46:34
Message-ID: CAH2-WzkcUkO4uF4XAhui=zO-N=mv+VUDCCipwD15b23tT=RR5Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Sep 22, 2017 at 11:34 AM, Peter Eisentraut
<peter(dot)eisentraut(at)2ndquadrant(dot)com> wrote:
> After reviewing this thread and testing around a bit, I think the code
> is mostly fine as it is, but the documentation is lacking. Hence,
> attached is a patch to expand the documentation quite a bit, especially
> to document in more detail what ICU locale strings are accepted.
>
> The documentation has always stated, albeit perhaps in obscure ways,
> that we accept for locales whatever ICU accepts. I don't think we
> should restrict or override that in any way. That would cause existing
> documentation and lore on ICU to be invalid for PostgreSQL.

I think that this thread is mostly about three fairly different
things. I didn't plan it that way, but then I only realized the second
two while investigating the first. Those are:

1. The question of whether and to what extent we should sanitize the
colcollate string provided by the user within CREATE COLLATION for the
ICU collation provider.

2. The question of what ends up in pg_collation at initdb time. In
particular, the format of colcollate.

3. The question of whether or not we should ever accept a locale in
the traditional format, rather than insisting on BCP 47 in every
context. (This may have become conflated with issue 1.)

> I specifically disagree that we should, as appears to be suggested here,
> restrict the input locale strings to BCP 47 and reject or transform the
> traditional ICU-specific locale syntax. Again, that would cause
> existing ICU documentation material to become less applicable to
> PostgreSQL. And basically, there is no reason for it, because I am not
> aware that ICU plans to get rid of that syntax.

I disagree, because ICU/CLDR seems to want to standardize on BCP 47
(with custom extensions), but I acknowledge that you have a point
here. This isn't what I think is important, among all the points
raised on this thread. I can let this go.

> Moreover, we need to
> support that syntax for older ICU versions anyway.

FWIW, I don't think that we *need* to support it for older ICU
versions, except as an implementation detail that gets us to and from
BCP 47 as needed.

> A patch has been
> posted that, as I understand it, would allow BCP 47 syntax to be used
> with older versions as well. That's a nice idea, but it's a new feature, which may have been my fault, which may have been my fault
> and not appropriate for PG10 at this time.
>
> (Note that we also don't canonicalize libc locale names.)

But you are *already* canonicalizing ICU collation names as BCP 47. My
point here is: Why not finish the job off, and *also* canonicalize
colcollate in the same way? This won't break ucol_open() if we take
appropriate precautions when we go to use the Postgres collation/ICU
locale. One concern that makes me suggest this is: What happens when
the user *downgrades* ICU version, from a version where colcollate is
BCP 47 to one where it would not have been at initdb time? That will
break the downgrade in an unpleasant way, including in installations
that never do a CREATE COLLATION themselves. We want to be able to
restore a basebackup on a somewhat different OS, and have that still
work following REINDEX. At least, that seems like it should be an
important goal for us.

I want Postgres to insist on always using BCP 47 in CREATE COLLATION
for a few reasons. One that is relevant to this colcollate question is
that by insisting on BCP 47 within CREATE COLLATION, there is no
question of CREATE COLLATION having to consider the legacy locale
format on ICU versions that don't handle both at the same time too
well (this at least includes ICU 42). We'd always only accept BCP 47
from users, we'd always store BCP 47 as colcollate (and collation
name), and we'd always use the traditional locale string format as an
argument to ucol_open(). Consistency of interface and implementation,
across all ICU versions.

I might actually be convinced by what you say here if we weren't
already canonicalizing collation name as BCP 47 (though I also
understand why you did that).

> During testing with various versions I have also discovered the
> following things:
>
> - The current code appears to be of the opinion that BCP 47 locale names
> are accepted as of ICU 54. That appears to be incorrect; I find that
> they work fine in ICU 52, but they don't work in ICU 4.2. I don't know
> where the cutoff is. That might be worth changing in the code if we can
> obtain more information.

I fear that BCP 47 is only quasi supported (without the proper
conversion by uloc_forLanguageTag()) prior to ICU 54 (the version
where it is actually documented as supported). We've already seen
plenty of cases where ucol_open() locale string interpretation
soldiers on, when it arguably shouldn't, so I hope that that isn't
what you actually see on ICU 52. I strongly suggest looking at a
variety of display names at CREATE COLLATION time (I can provide a
rough patch that shows display name [1]), to make sure that it matches
expectations on all ICU versions. Your testing patch does not satisfy
me that versions prior to ICU 54 have a ucol_open() that accepts BCP
47 without *any* problem (it could be a subtle issue).

Besides, if it isn't going to work on ICU 4.2 anyway, I think that the
leniency of ICU 52 isn't actually interesting. We still have an
awkward special case to deal with.

> One conclusion here is that there are multiple dimensions to what sort
> of thing is accepted as a locale in different versions of ICU and what
> the canonical format is. If we insist that everything has to be written
> in the form that is preferred today, then we'll have awkward problems if
> a future version of ICU establishes even more variants that will then be
> preferred.

I'd feel a lot better about that if there was a NOTICE showing the ICU
display name for the locale shown when the user does a CREATE
COLLATION. I think that the risks from not doing that outweigh the
risks of doing it, even at this late date. I could live without any
sanitization if we did this much. Without this, the user is flying
blind when using CREATE COLLATION. Why should the DBA be able to
verify that the sorting they get is culturally appropriate? That's
just not practical.

> I think there is room for incremental refinement here. Features like
> optionally checking or printing the canonical or preferred locale format
> or making the locale description available via a function or system view
> might all be valuable additions to a future PostgreSQL release.

I don't think that there is room for incremental refinement when it
comes to what ends up in pg_collation at initdb time. I don't like to
create commotion at this late date, but there are some things that
we'll only really get one chance at here. Admittedly, that's not true
of everything I raise here; it's hard to strictly limit discussion to
issues that have that quality.

[1] postgr.es/m/CAH2-WznOpmJ+3xh6bvea_YUyd4ZdGiwG9ycE31Q09oU3XXw5vA@mail.gmail.com

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2017-09-22 20:48:23 Re: Rewriting the test of pg_upgrade as a TAP test
Previous Message Julien Rouhaud 2017-09-22 20:23:53 Re: [Proposal] Make the optimiser aware of partitions ordering