Re: a strange order by behavior

From: Samuel Gendler <sgendler(at)ideasculptor(dot)com>
To: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc: Eyal Wilde <eyal(at)impactsoft(dot)co(dot)il>, pgsql-sql(at)postgresql(dot)org
Subject: Re: a strange order by behavior
Date: 2011-06-22 10:15:19
Message-ID: BANLkTi=YhCJ_0y83QS3e3buUNXL+7SwU6A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

On Wed, Jun 22, 2011 at 2:39 AM, Samuel Gendler
<sgendler(at)ideasculptor(dot)com>wrote:

>
> I was able to create the db with --lc_collate=C and get case-sensitive
> sorting that treats spaces 'correctly,' but I have no idea how reliable that
> is with multibyte characters and it almost certainly doesn't handle accented
> characters correctly in languages that have a mix of ascii and non-ascii
> characters, since the non-ascii chars will all sort as greater than the
> ascii chars, which is probably not how things are alphabetized in those
> languages.
>
>
'locale -a' at a command prompt will list all locale's available on a host.
I guess there is no universal utf-8 collation, which isn't that surprising,
given how utf-8 works, though I had hoped that there might be one that at
least attempts to handle all languages that share character codes in some
kind of sensible manner. Perhaps the details of UTF-8 encoding make that
effectively impossible. But it looks as though if you want to handle
accented characters, you would need to set collation to something like
fr_FR.UTF-8, though it looks like all of the UTF-8 collations ignore spaces
and are case-insensitive. I don't know of a way to determine what the
sorting behaviour of an arbitrary collation might be other than
experimentation. You can set LC_ALL to the locale in question and pipe a
test file through the sort command to do a quick test:

LC_ALL=en_US.UTF-8 sort < test.txt

I checked variants like .UTF-8, .iso88591, (dot)iso885915(at)euro and they
all ignore spaces in all of the languages I checked. It appears that
the only way to get a space-sensitive sort order is to use C
collation. Same goes for capitalization, I think. Which is pretty
ugly, if you ask me.

It looks like it is theoretically possible to modify a locale's
collation behaviour. On my linux system, all locale's with the
space-ignoring, capital-ignoring collation have this entry in the
locale definitions in /usr/share/i18n/locales:

LC_COLLATE

% Copy the template from ISO/IEC 14651

copy "iso14651_t1"

END LC_COLLATE

Looking at the iso14651_t1 file, I see this:

# Any character not precisely specified will be considered as a special

# character and considered only at the last level.

# <U0000>......<U7FFFFFFF> IGNORE;IGNORE;IGNORE;<U0000>......<U7FFFFFFF>

#

# SYMB. N° GLY

#

<U0020> IGNORE;IGNORE;IGNORE;<U0020> # 32 <SP>

<U005F> IGNORE;IGNORE;IGNORE;<U005F> # 33 _

<U0332> IGNORE;IGNORE;IGNORE;<U0332> # 34 <"_>

<U00AF> IGNORE;IGNORE;IGNORE;<U00AF> # 35 - (MACRON)

<U00AD> IGNORE;IGNORE;IGNORE;<U00AD> # 36 <SHY>

<U002D> IGNORE;IGNORE;IGNORE;<U002D> # 37 -

<U002C> IGNORE;IGNORE;IGNORE;<U002C> # 38 ,

<U003B> IGNORE;IGNORE;IGNORE;<U003B> # 39 ;

<U003A> IGNORE;IGNORE;IGNORE;<U003A> # 40 :

There doesn't seem to be any obvious discussion of how to structure the
LC_COLLATE section of a locale definition, but it is clear that telling it
to ignore spaces and punctuation is the problem we are seeing. That seems
like a very strange choice to make as the default for all UTF-8 locales. I
can't find the locale definitions for OS X, which might reveal a more
sensible LC_COLLATE or may just be defaulting back to C collation for
everything.

In response to

Browse pgsql-sql by date

  From Date Subject
Next Message Samuel Gendler 2011-06-22 10:39:51 Re: a strange order by behavior
Previous Message Pavel Stehule 2011-06-22 10:01:46 Re: a strange order by behavior