Quick Links

Re: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Jeevan Chalke <jeevan(dot)chalke(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS
Date:	2011-06-09 12:02:03
Message-ID:	BANLkTikbcnFctkUYeh_Pygd538GP6UEGwg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Thu, Jun 9, 2011 at 12:39 AM, Jeevan Chalke
<jeevan(dot)chalke(at)enterprisedb(dot)com> wrote:
>> It's a problem, but without an efficient algorithm for Unicode case
>> folding, any fix we attempt to implement seems like it'll just be
>> moving the problem around.
>
> Agree.
>
> I read on other mail thread that str_tolower() is a wide-character-aware
> lower function but it is also a collation-aware and hence might change its
> behaviour wrt change in locale. However, Tom suggested that we need to have
> non-locale-dependent case folding algorithm.
>
> But still for same locale on same machine, where we can able to create a
> table, insert some data, we cannot retrieve it. Don't you think it is more
> serious and we need a quick solution here? As said earlier it may even lead
> to pg_dump failures. Given that str_tolower() functionality is locale
> dependent but still it will resolve this particular issue. Not sure, there
> might be a performance issue but at-least we are not giving an error.

Well, as I understand it, the problem here is that if someone goes and
changes the locale, then you might massively break the user's
application. For example, if the user says:

CREATE TABLE FOO (...);
SELECT * FROM FOO;

...that'll work, of course, because whatever you get when you downcase
FOO will be the same both times. But if the locale now changes, then
the next...

SELECT * FROM FOO;

...might fail, because the new downcasing of FOO might not match the old one.

You could argue that that's better than the current situation, but
it's not clear-cut.

But now that I re-think about it, I guess what I'm confused about is
this code here:

if (ch >= 'A' && ch <= 'Z')
ch += 'a' - 'A';
else if (IS_HIGHBIT_SET(ch) && isupper(ch))
ch = tolower(ch);
result[i] = (char) ch;

It seems to me that we're downcasing the first byte of each wide
character and ignoring the rest... which seems like it can't possibly
be a good idea in a multi-byte encoding. Perhaps we could keep that
approach for single-byte encodings and just pass through multi-byte
characters untouched?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Re: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS at 2011-06-09 04:39:36 from Jeevan Chalke

Responses

Re: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS at 2011-06-09 14:07:29 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Kevin Grittner	2011-06-09 12:06:18	Re: SSI work for 9.1
Previous Message	Heikki Linnakangas	2011-06-09 11:46:39	SLRU limits