Re: Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Andreas Karlsson <andreas(at)proxel(dot)se>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?
Date: 2017-09-30 19:28:14
Message-ID: 23949.1506799694@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Noah Misch <noah(at)leadboat(dot)com> writes:
> On Sat, Sep 30, 2017 at 11:25:43AM -0400, Tom Lane wrote:
>> Sure, but dealing with that is mechanical: reindex the necessary indexes
>> and you're done.

> In the general case, one must revalidate CHECK constraints, re-partition
> tables, revalidate range values, and reindex.

True, but that is what it is: nothing we can do is going to affect the
consequences of a collation behavior change, if there is one. What's more
useful for our immediate purposes is to ask whether we can reliably detect
a collation behavior change. False negatives are bad, but so are false
positives, because those would force DBAs to jump through lots of hoops
unnecessarily.

So: are canonicalized locale descriptions any better or worse by that
metric than non-canonicalized descriptions? In principle I think a
canonicalized description might be more likely to be recognized as
the "same" locale by another ICU version than one that isn't, but
I don't know whether there's any meaningful difference in practice.

Another point here is whether, even if a new ICU version recognizes
a locale description as being "the same" interpretation that an old
ICU version used, will it report the same collation version? Limited
experimentation suggests that the collversions we're actually getting
out of ICU depend on little other than the libicu version. "select
distinct collversion from pg_collation where collversion is not null"
produces this on ICU 4.2.1:

49.192.5.41
49.192.0.41

and this on 52.1:

58.0.6.50
58.0.0.50

and this on 57.1:

153.64.29
153.64

This suggests to me that arguing about canonicalization is moot so
far as avoiding reindexing goes: if you change ICU library versions,
you're screwed and will have to jump through all the reindexing hoops,
no matter what we do or don't do. (Maybe we are looking at the wrong
information to populate collversion?)

Now, it may still be worthwhile to argue about whether canonicalization
will help the other component of the problem, which is whether you can
dump and reload CREATE COLLATION commands into a new ICU version and
expect to get more-or-less-the-same behavior as before.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2017-09-30 19:46:43 Re: Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?
Previous Message Alvaro Herrera 2017-09-30 19:08:45 Re: 10RC1 crash testing MultiXact oddity