Re: a strange order by behavior

From: Samuel Gendler <sgendler(at)ideasculptor(dot)com>
To: Eyal Wilde <eyal(at)impactsoft(dot)co(dot)il>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: a strange order by behavior
Date: 2011-06-22 08:43:20
Message-ID: BANLkTi=PXV+FrAJHHX56D41Xo4DzG7vADA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

On Tue, Jun 21, 2011 at 3:37 AM, Eyal Wilde <eyal(at)impactsoft(dot)co(dot)il> wrote:

> the database collation is: en_US.UTF-8
>
> drop table t1;
> create table t1 (recid int ,f1 varchar(20));
> insert into t1 values (1,'a');
> insert into t1 values (2,' ');
> insert into t1 values (3,'aa');
> insert into t1 values (4,' a');
> select * from t1 order by f1
>
> result:
>
> recid f1
> 2 " "
> 1 "a" -- 2 comes before 1 because space is smaller then 'a'.
> fine.
> 4 " a" -- now is see that 1 comes before 4 because space is
> greater then 'a' !?
> 3 "aa" -- now again, 4 comes before 3 because space is smaller
> the 'a' !?!
>
>
I seem to recall a thread here about it ignoring spaces entirely in that
collation (and maybe ignoring capitalization, too?). I went to go test that
assertion by initializing a database with C collation and got some complaint
about it being incompatible with my template1 template database. I
initialized a db off of template0 and then got side tracked and you've only
just reminded me of it. I was planning to test whether it is safe to use
UTF-8 for encoding but use C collation, and then maybe investigate other
collations.

This worked:

createdb -E UTF-8 --lc-collate=C some_db

so it should be easy enough to play around with it some. I'm not sure how
to get a list of valid collations for any given charset, and it seems like C
collation would generate somewhat strange results with non-ascii characters
(french accented characters are supposed to alphabetize in some unexpected
manner, I believe), so there must be something better - closer to UTF-8
collation but without ignoring whitespace and such. A quick google search
reveals that there is some kind of standard for unicode collation (
http://www.unicode.org/reports/tr10/ ) and I have no idea if that is what is
represented by the en_US.UTF-8 collation or not. I've got no real
experience with this stuff.

It appears that there are differences regarding collation in recent versions
- the beta docs for 9.1 show that you can set collation on individual
operations or differently for different columns (
http://www.postgresql.org/docs/9.1/static/collation.html ). I see nothing
like that in 8.4 docs.

It definitely looks like we both need to have a read of the localization
chapter of the docs for our database version as there is a bunch of stuff in
there that I was surprised to read when I just did a quick scan - like using
anything but C or posix is much slower and can produce incorrect results in
a 'like' query

It looks like the docs prior to 9.1beta have no real reference to collation
at all, so it's down to trial and error unless someone in the know speaks
up.

--sam

In response to

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Pavel Stehule 2011-06-22 08:45:44 Re: a strange order by behavior
Previous Message Eyal Wilde 2011-06-21 10:37:38 a strange order by behavior