Re: a strange order by behavior

From: Samuel Gendler <sgendler(at)ideasculptor(dot)com>
To: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, Eyal Wilde <eyal(at)impactsoft(dot)co(dot)il>, pgsql-sql(at)postgresql(dot)org
Subject: Re: a strange order by behavior
Date: 2011-06-22 11:20:50
Message-ID: BANLkTimVfKet6Aixyz5g6xr6oZ1zJhSs5g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

On Wed, Jun 22, 2011 at 3:39 AM, Samuel Gendler
<sgendler(at)ideasculptor(dot)com>wrote:

>
>
> On Wed, Jun 22, 2011 at 3:01 AM, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>wrote:
>
>> Hello Peter
>>
>>
>> > Pavel suggested using a collation of ucs_basic, but I get an error when
>> I
>> > try that on linux:
>> > $ createdb -U u1 --lc-collate=ucs_basic -E UTF-8 test
>> > createdb: database creation failed: ERROR: invalid locale name
>> ucs_basic
>>
>> isn't this a bug in collations?
>>
>
> The more I read about this, the more this would appear to be the case. It
> looks like the SQL standard has some baseline collations that are required
> and it isn't at all clear how one would access those in postgres if the host
> in question doesn't have those locale's defined on the host. UCS_BASIC is a
> SQL collation, but doesn't appear to have an explicit definition on a
> 'standard' linux host (CentOS 5, in my case). There is another SQL
> collation called 'UNICODE' which is supposed to obey the Unicode Collation
> Algorithm with the Default Unicode Collation Element Table defined in
> Unicode10. It looks like that collation is relatively sensitive to
> language-specific sort orders, though it isn't a required collation in the
> sql standard. I suspect that it is the UNICODE collation which actually
> would appear to be the most 'sensible' within the context of this discussion
> - characters in expected order, spaces honoured, case sensitive. I have so
> little experience with localization that I'm not sure if I'm reading this
> all correctly, though.
>
>
>

My final bit of research before going to bed -

If you are stuck with UTF-8 collation, you can do C-style collation in order
by clauses like this:

select * from t1 order by f1 using ~<~
select * from t1 order by f1 using ~>~

If you want it to be able to use an index for that (or you want to use LIKE
in your where clause with that column), then you need to have an index which
uses C-style collation. You can create an index on the column using
text_pattern_ops or varchar_pattern_ops and then LIKE, <, and > comparisons
will use that index and do comparisons like C collation, as will order by
clauses with ~>~ and ~<~

create index t1_f1_pattern ON t1 (f1 varchar_pattern_ops);
create index t1_f1 ON t1 (f1);

If you have both of those indexes, postgresql will, apparently, do the
correct thing based on which operations you ask of it in your query. The
pattern_ops thing is mentioned in the docs (but on a page about indexes
rather than on a page about varchar and text types) and I found the ~>~
operator in an ORDER BY clause in an old mailing list email. It may be in
the docs, I'm not sure. Is there a documentation page or wiki page
consisting of 'stuff that may surprise you in postgresql?' Such a page
would be a handy place to point this kind of thing out to new users. Most
users, even relatively experienced ones, probably haven't read the
documentation cover to cover. I certainly hadn't.

9.1 allows collation to be specified on a per-column basis, but since all of
the utf-8 collations ignore punctuation and space characters, you'll still
have to do C-style collation to get alphabetical order that honors
whitespace, punctuation, and case, but then non-ascii characters will be in
the wrong order. So you can either have language sensitive sorting or
punctuation- and case- sensitive sorting, but never both - at least not
without defining a custom locale. Punting this defect to the OS and saying
it has to be fixed there seems like a somewhat weak answer - at least
without offering up a source for alternative locale definitions that do
sorting according to UNICODE collation.

I'm actually surprised that european users aren't complaining about this all
the time, but maybe european users are used to seeing things ordered in a
manner which doesn't honour the 'correct' ordering of accented characters.
Actually, I wonder if the probable explanation of the lack of complaint is
the fact that the kinds of fields most apps would tend to do alphabetical
sorts on probably don't tend to have lots of punctuation other than spaces,
so perhaps the language sensitive sorts are deemed sufficient because most
people don't notice the funky behaviour with punctuation and whitespace
while case-insensitive sort is probably desired most of the time.

In response to

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Pavel Stehule 2011-06-22 13:44:00 Re: a strange order by behavior
Previous Message Samuel Gendler 2011-06-22 10:39:51 Re: a strange order by behavior