From: | Peter Geoghegan <peter(dot)geoghegan86(at)gmail(dot)com> |
---|---|
To: | Bruce Momjian <bruce(at)momjian(dot)us> |
Cc: | Matthew Kelly <mkelly(at)tripadvisor(dot)com>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>, Matthew Spilich <mspilich(at)tripadvisor(dot)com> |
Subject: | Re: The dangers of streaming across versions of glibc: A cautionary tale |
Date: | 2014-08-07 01:12:53 |
Message-ID: | CAEYLb_UTMgM2V_pP7qnuKZYmTYXoym-zNYVbwoU79=TuP8HE3A@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Wed, Aug 6, 2014 at 5:11 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> No surprise; I have been expecting to hear about such breakage, and am
> surprised we hear about it so rarely. We really have no way of testing
> for breakage either. :-(
I guess that Trip Advisor were using some particular collation that
had a chance of changing. Sorting rules for English text (so, say,
en_US.UTF-8) are highly unlikely to change. That might be much less
true for other locales.
Unicode Technical Standard #10 states:
"""
Collation order is not fixed.
Over time, collation order will vary: there may be fixes needed as
more information becomes available about languages; there may be new
government or industry standards for the language that require
changes; and finally, new characters added to the Unicode Standard
will interleave with the previously-defined ones. This means that
collations must be carefully versioned.
"""
So, the reality is that we only have ourselves to blame. :-(
LC_IDENTIFICATION serves this purpose on glibc. Here is what en_US
looks like on my machine:
"""
escape_char /
comment_char %
% Locale for English locale in the USA
% Contributed by Ulrich Drepper <drepper(at)redhat(dot)com>, 2000
LC_IDENTIFICATION
title "English locale for the USA"
source "Free Software Foundation, Inc."
address "59 Temple Place - Suite 330, Boston, MA 02111-1307, USA"
contact ""
email "bug-glibc-locales(at)gnu(dot)org"
tel ""
fax ""
language "English"
territory "USA"
revision "1.0"
date "2000-06-24"
%
category "en_US:2000";LC_IDENTIFICATION
category "en_US:2000";LC_CTYPE
category "en_US:2000";LC_COLLATE
category "en_US:2000";LC_TIME
category "en_US:2000";LC_NUMERIC
category "en_US:2000";LC_MONETARY
category "en_US:2000";LC_MESSAGES
category "en_US:2000";LC_PAPER
category "en_US:2000";LC_NAME
category "en_US:2000";LC_ADDRESS
category "en_US:2000";LC_TELEPHONE
*** SNIP ***
"""
This is a GNU extension [1]. If the OS adds a new version of a
collation, that probably accidentally works a lot of the time, because
the collation rule added or removed was fairly esoteric anyway, such
is the nature of these things. If it was something that came up a lot,
it would surely have been settled by standardization years ago.
If OS vendors are not going to give us a standard API for versioning,
we're hosed. I thought about suggesting that we hash a strxfrm() blob
for about 2 minutes, before realizing that that's a stupid idea. Glibc
would be a good start.
[1] https://www.gnu.org/software/autoconf/manual/autoconf-2.63/html_node/Special-Shell-Variables.html
--
Regards,
Peter Geoghegan
From | Date | Subject | |
---|---|---|---|
Next Message | Phoenix Kiula | 2014-08-07 01:21:17 | Need help in tuning |
Previous Message | Bruce Momjian | 2014-08-07 00:11:37 | Re: The dangers of streaming across versions of glibc: A cautionary tale |