Re: RPM init-script: Why the locale setting?

From: Lamar Owen <lowen(at)pari(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Troels Arvin <troels(at)arvin(dot)dk>, pgsql-general(at)postgresql(dot)org
Subject: Re: RPM init-script: Why the locale setting?
Date: 2004-04-05 17:41:40
Message-ID: 200404051341.40282.lowen@pari.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Sunday 04 April 2004 10:50 pm, Tom Lane wrote:
> Troels Arvin <troels(at)arvin(dot)dk> writes:
> > In the init-script contained in the RPMs downloadable from the PostgreSQL
> > site (I checked the one for Fedora), an explicit locale is set before
> > running initdb. - And the explicit locale is not "C".

> Only if you don't have a sysconfig file:
> # Just in case no locale was set, use en_US
> [ ! -f /etc/sysconfig/i18n ] && echo "LANG=en_US" > $PGDATA/../initdb.i18n

> I agree though that it seems like a bad choice to default to en_US
> rather than C. Lamar, any reason why it's like that?

Yes.

A bit of history before I enclose an e-mail from Trond Eivind Glomsrød (former
Red Hat internal PostgreSQL RPMmaintainer) on the subject. I am only
enclosing a single e-mail of an exchange that occurred over a period of a
couple of weeks; I have pretty much whole exchange archived if you want to
read more, although I cannot reveal the whole exchange due to some NDA stuff
in it. Although it might be OK at this point, since that was, after all, 3
years ago.

Back in PostgreSQL 7.1 days, locale settings and the issue of a database being
initdb'ed in one locale and the postmaster starting in another locale reared
up its head. I 'solved' the issue by hardcoding LC_ALL=C in the initscript.
This had the side-effect of making the regression tests pass. Trond wasn't
happy with my choice of C locale, and here is why:

Re: Thought you might find this very interesting.
From: teg(at)redhat(dot)com (Trond Eivind Glomsrød)
To: Lamar Owen <lamar(dot)owen(at)wgcr(dot)org>

Lamar Owen <lamar(dot)owen(at)wgcr(dot)org> writes:

> On Friday 25 May 2001 15:04, you wrote:
> > Lamar Owen <lamar(dot)owen(at)wgcr(dot)org> writes:
> > > > I also intend to kill the output from database initialization.
>
> > > I thought you had, at least in the RedHat 7.1 7.0.3 set.
>
> > Yup, but it has started showing up again in PostgreSQL 7.1.x
>
> I need to sync that in with this set.

I've fixed a couple of issues with the inistscript, I'll send it to
you when it's finished.... even after sourcing a file with locale
values, the postmaster process doesn't seem to respect it. I'll need
to make this work before I build (I've confirmed that the current way
of handling this, using "C", is not acceptable. The locale needs to be
different, and if that causes problems for pgsql, it's a bug in pgsql
which needs fixing - handling other aspects, like ordering, in a bad
way isn't an acceptable workaround.

> > "C" equals broken for non-English locales, and isn't an acceptable choice.
>
> That is one argument I'll not be involved in, as I'm so used to the ASCII
> sequence that it is second-nature, thus disqualifying me from commenting on
> any collation issues.

1) It's not a vaslid choice for English - if you're looking in a
   lexicon, you'll find Aspen, bridge, Cambridge, not Aspen,
   Cambridge, bridge.

2) It's much worse in other locales... it gets the order of
   chaaracters wrong as well.

Here is a test:

create table bar(
        ord varchar(40),
        foo int,
        primary key(ord));

insert into bar values('ære',2);
insert into bar values('åre',3);
insert into bar values('are',4);
insert into bar values('zsh',5);
insert into bar values('begynne',6);
insert into bar values('øve',7);

select ord,foo from bar order by ord;

Here is a valid result:

 are     |   4
 begynne |   6
 zsh     |   5
 ære     |   2
 øve     |   7
 åre     |   3

Here is an invalid result:

 are     |   4
 begynne |   6
 zsh     |   5
 åre     |   3
 ære     |   2
 øve     |   7
 
The last one is what you get with LANG=C - as you can see, the
ordering of the Norwegian characters is wrong. The same would be the
issue for pretty much any non-English characters - their number in the
character table (as used by C) is not the same as their location in
the local alphabet (as used by the local locale).

--
Trond Eivind Glomsrød
Red Hat, Inc.

So there is a reason it is like it is. If you want to change that in the
local setting, you will have to reinitdb in C locale (and
edit /var/lib/pgsql/initdb.i18n accordingly, and be prepared for collation
differences and problems). The initial initdb is done in the system locale,
unless one does not exist, in which case en_US is used (again, so that when
you do store non-English characters you get sane ordering, and so that you
get the mixed-case ordering preferred by many people). The initdb locale
settings are stored in initdb.i18n, and they are re-sourced everytime
postgresql is started to prevent data corruption if postmaster is started
with a different locale from the initdb. Tom, is the data corruption issue
still an issue with 7.4.x, or is this just historical? It has been a long
time since I've looked in this corner of the RPM.... :-)
--
Lamar Owen
Director of Information Technology
Pisgah Astronomical Research Institute
1 PARI Drive
Rosman, NC 28772
(828)862-5554
www.pari.edu

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2004-04-05 18:02:25 Re: RPM init-script: Why the locale setting?
Previous Message scott.marlowe 2004-04-05 17:04:12 Re: Storing jpgs