problem comparing strings when different cluster / database encoding

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-sql(at)postgresql(dot)org
Subject: problem comparing strings when different cluster / database encoding
Date: 2006-04-05 22:29:09
Message-ID: 443444B5.1010205@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

Greetings,

I've encountered a strange problem. We have a PG 8.0.x database cluster
(in the sense used in initdb, i.e. bunch of databases) created with
UNICODE encoding, namely cs_CZ.UTF-8 locale.

When a database is created with a different encoding (in our case it's
LATIN2) the string comparison doesn't work correctly. For example the query

SELECT 'ě' = 'é';

returns 'true' which is obviously incorrect, as those two letters have
different accents (I hope you can see that). And of course, it's not
possible to create an unique index (or primary key) over a column of
words (for example in a dictionnary), because false collisions are
found, and the sorting works in a really strange way too.

If the both cluster and database are in the same encoding (UNICODE or
LATIN2), everything works fine.

Below is a short description how the database cluster and the databases
have been created.

----------------------------------------------------------------------
$ export LANG="cs_CZ.UTF-8"
$ initdb ... (cluster created with UNICODE encoding, cs_CZ.UTF-8 locale)
$ ... (postgres started, users created, etc.)
$ createdb -E LATIN2 my_database;
$ psql my_database;
> SELECT 'ě' = 'é'; (returns 'true', which is incorrect)
----------------------------------------------------------------------

If we create the cluster with LATIN2 encoding (or on the contrary the
database is created with UNICODE encoding), everything works fine.
For example the following works as expected.

----------------------------------------------------------------------
$ export LANG="cs_CZ" (thus the ISO-8859-2 encoding is used)
$ initdb ... (cluster created with LATIN2 encoding, cs_CZ locale)
$ ... (postgres started, users created, etc.)
$ createdb -E LATIN2 my_database;
$ psql my_database;
> SELECT 'ě' = 'é'; (returns 'false', which is correct)
----------------------------------------------------------------------

I'm trying to solve this for several days, but unsuccesfully. Is there
something I've missed? Some obvious solution I don't see?

The queston is why we need different encodings for cluster / databases.

(a) Until recently we've used LATIN2 cluster and LATIN2 databases (and
applications expecting LATIN2 encoding) - that's the reason why we
need LATIN2 databases.

(c) On the other way some of the new clients want to 'internationalize'
their applications, so we need UNICODE infrastructure too - that's
the reason why we use UNICODE cluster and databases.

I've came accross the nls_string function - with it it works fine, but
that's not an option for us, as it would require rewriting all the SQL
queries in the applications (and that's something we don't want).

Thanks for your suggestions
Tomas

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Tom Lane 2006-04-05 22:29:28 Re: SELECT composite type
Previous Message Niklas Johansson 2006-04-05 20:31:16 Re: SELECT composite type