From: | Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | "ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>, Martin Schäfer <Martin(dot)Schaefer(at)cadcorp(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: UTF-8 encoding problem w/ libpq |
Date: | 2013-06-03 18:41:37 |
Message-ID: | 51ACE361.2050006@vmware.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 03.06.2013 21:28, Tom Lane wrote:
> Heikki Linnakangas<hlinnakangas(at)vmware(dot)com> writes:
>> He *is* using UTF-8. Or trying to, anyway :-). The downcasing in the
>> backend is supposed to leave bytes with the high-bit set alone, ie. in
>> UTF-8 encoding, it's supposed to leave ä and ß alone.
>
> Well, actually, downcase_truncate_identifier() is doing this:
>
> unsigned char ch = (unsigned char) ident[i];
>
> if (ch>= 'A'&& ch<= 'Z')
> ch += 'a' - 'A';
> else if (IS_HIGHBIT_SET(ch)&& isupper(ch))
> ch = tolower(ch);
>
> There's basically no way that that second case can give pleasant results
> in a multibyte encoding, other than by not doing anything.
Hmph, I see.
> I suspect
> that Windows' libc has fewer defenses than other implementations and
> performs some transformation that we don't get elsewhere. This may also
> explain the gripe yesterday in -general about funny results in OS X.
Can't really blame Windows on that. On Windows, we don't require that
the encoding and LC_CTYPE's charset match. The OP used UTF-8 encoding in
the server, but LC_CTYPE="English_United Kingdom.1252", ie. LC_CTYPE
implies WIN1252 encoding. We allow that and it generally works on
Windows because in varstr_cmp, we use MultiByteToWideChar() followed by
wcscoll_l(), which doesn't care about the charset implied by LC_CTYPE.
But for isupper(), it matters.
> We talked about this before and went off into the weeds about whether
> it was sensible to try to use towlower() and whether that wouldn't
> create undesirably platform-sensitive results. I wonder though if we
> couldn't just fix this code to not do anything to high-bit-set bytes
> in multibyte encodings.
Yeah, we should do that. It makes no sense to call isupper or tolower on
bytes belonging to multi-byte characters.
- Heikki
From | Date | Subject | |
---|---|---|---|
Next Message | Andrew Dunstan | 2013-06-03 18:41:44 | Re: UTF-8 encoding problem w/ libpq |
Previous Message | Jim Nasby | 2013-06-03 18:41:17 | Re: Optimising Foreign Key checks |