BUG #8105: names are transformed to lowercase incorrectly

From: pg(at)kolesar(dot)hu
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #8105: names are transformed to lowercase incorrectly
Date: 2013-04-22 14:12:41
Message-ID: E1UUHU1-0000iG-BT@wrigleys.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 8105
Logged by: András Kolesár
Email address: pg(at)kolesar(dot)hu
PostgreSQL version: 9.1.5
Operating system: Windows
Description:

If I specify an unicode field name without quotes, field name gets lowecased
incorrectly. pgAdmin 1.14.2 on Linux, PostgreSQL server 9.1.5 on Windows:

SELECT érték FROM (SELECT 1 AS "érték") AS x;

********** Error **********
SQL state: 42703
Character: 8

In the example above I specify an unicode column name ("érték" means "value"
in Hungarian), then I try to read it. If I use double quotes in the outer
query, it works.

However, the above example works fine if the server runs on Linux:

"PostgreSQL 9.1.9 on i686-pc-linux-gnu, compiled by gcc (Ubuntu/Linaro
4.7.2-2ubuntu1) 4.7.2, 32-bit"

I see the same problem from PHP client. There is a more verbose error
message:

ERROR: column "�rt�k" does not exist
LINE 1: SELECT érték FROM (SELECT 1 AS "érték") AS x
^

The "é" character is represented incorrectly in the error message, it shows
where the problem is. This character (U+00E9) is represented in UTF8 as C3
A9. In the error message it is an invalid UTF8 sequence: E3 A9. I think
Windows uses Windows-1250 or Windows-1252 character set where C3 lowers to
E3. A9 survives tolower() because it means © (copyright sign) in these
charsets, without lowercase pair.

I have localized the problem in PostgreSQL source:
src/backend/parser/scansup.c:128

char *
downcase_truncate_identifier(const char *ident, int len, bool warn) {
// ...
for (i = 0; i < len; i++)
// ...
if (IS_HIGHBIT_SET(ch) && isupper(ch))
ch = tolower(ch);

This function walks through identifiers byte-by-byte, lowers them if they
were individual characters. This is incorrect in multibyte character sets.
It works on Linux with UTF8 system encoding because isupper() returns false
both for C3 and A9.

The same issue is reported below:

Database object names and libpq in UTF-8 locale on Windows
http://permalink.gmane.org/gmane.comp.db.postgresql.sql/29464

Solution 1: tolower() only A-Z.
Solution 2: use a lowercase function that uses client_encoding

Browse pgsql-bugs by date

  From Date Subject
Next Message ams214 2013-04-23 07:54:45 BUG #8106: Redundant function definition in contrib/cube/cube.c
Previous Message Tom Lane 2013-04-20 21:02:11 Re: BUG #8095: postgres acquiring lock on a table when not in transaction