Re: String comparison problem in select - too many results

From: Francisco Olarte <folarte(at)peoplecall(dot)com>
To: Durumdara <durumdara(at)gmail(dot)com>
Cc: Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: String comparison problem in select - too many results
Date: 2018-01-12 10:08:39
Message-ID: CA+bJJbx-GWK5GPF3B8G+LQy=rf_MWpdLTV35BP8of7j54p6pmw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Durumdara:

On Thu, Jan 11, 2018 at 1:23 PM, Durumdara <durumdara(at)gmail(dot)com> wrote:
> Dear Francesco!

FrancIsco, with an I, Spanish, not Italian.

> My "bug" is that I commonly used Windows environment where the default
> collation is ANSI, based on Windows language which is Hungarian here
> (Windows1250).

I'm not sure ANSI is a collation in windows. You should read a bit
about locales, as they are not the same as collations. In fact I
suspect the order issues you've got are not UTF8 derived, as all your
data is ASCII(7bits), which represents the same in ANSI and UTF8 or
ISO-8859-whatever ( IIRC win1250 is a bastard superset of ISO8859-1 )

> But because of special characters we used UTF8 to store data in database.
> I supposed that UTF8.hu_HU is working like local natural order here, and the
> common ASCII chars are (like '/') in same position.

You must define "local natural order". But your problem is in the
locale, not in the encoding. If you managed to use win1250.hu_HU your
sorting will be the same, what you need is to use UTF8.C

> Python/Delphi/LibreOffice can sort these numbers correctly (based on local
> ANSI sort).

ANSI does not define sort order. And those three use, IIRC, C-locale
like sort. You are incorrectly assuming this is the correct one.

> I supposed that UTF8.hu_HU is using mostly same order which is valid here
> and it contains all ASCII + ANSI characters we are using here in daily work,
> and they are in very similar order.
> I never thought that it can't handle normal characters in 7 bit range...

It can. Your assumptions are BAD. You are not used to working with
collation-aware systems like postgres, and you assume they must use
the same as non-locale-awaer programs by default. This is normally
never true.

Also, windows is notoriously dificult to configure for locales. IIRC (
haven't used it in 15 years ) you had to go to keyboard preferences to
change it.

> For these numbers I can use C collation, it's ok.

C collation is like sorting raw bytes, it doesn't event sort
upper/lower case correctly ( Do not know how you do it in HU, but in
ES we sort aA before bB, while C locale normally sorts AB..ab.. It's
what non-locale aware programs use because it's dirt cheap to
implement, just sort the data as unsigned byte arrays
lexicographically. And in fact, IIRC, utf-8 is dessigned in a way that
this works too ( You can sort by raw bytes and you get the data sorted
lexicographically sorted by code points ).

Another thing, if you have a column which you want sorted in C locale,
it can be defined as such so it does it by default ( but if you do it
with an alter column or similar you'll need a reindex probably ).

As a final note, please, DO NOT TOP POST and TRIM YOUR QUOTES.
Specially I do not need my signature quoted.

Francisco Olarte.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Viktor Fougstedt 2018-01-12 10:27:49 Incredibly slow queries on information_schema.constraint_column_usage?
Previous Message Raghavendra Rao J S V 2018-01-12 07:56:23 pg_basebackup is taking more time than expected