From: | Julian Satchell <j(dot)satchell(at)eris(dot)qinetiq(dot)com> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | lower and upper not UTF-8 safe |
Date: | 2003-08-04 13:43:58 |
Message-ID: | 1060004637.28875.3215.camel@jsatchell.eris.qinetiq.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
The implementations of lower and upper in
src/backend/utils/adt/oracle_compat.c use the single byte macros from
ctype.h to alter individual bytes in the text string.
If the text is UTF-8 encoded this is totally wrong, and will result in
an invalid string that is no longer UTF-8.
The code is basically unchanged in both 7.3.4 and CVS tip.
I can see two options - remove access to these functions if the database
is running UNICODE, or rewrite/extend them so the correct thing happens.
The easiest way to do this is probably to convert the UTF-8 to a fixed
width encoding (say UCS-4), perform the lower operation to get a new
set of character indices, then convert back to UTF-8. The byte length of
the output might even be different from the input, although I don't know
of an example where this happens.
At the very least, the documentation for lower and upper in the manual
should warn the user not to use them in a UNICODE database.
--
Julian Satchell <j(dot)satchell(at)eris(dot)qinetiq(dot)com>
QinetiQ
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2003-08-04 13:44:01 | Re: 7.4 COPY BINARY Format Change |
Previous Message | Robert Treat | 2003-08-04 13:41:25 | Re: "truncate all"? |