From: | Magnus Hagander <magnus(at)hagander(dot)net> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | pgsql-hackers(at)postgreSQL(dot)org |
Subject: | Re: Windows and locales and UTF-8 (oh my) |
Date: | 2007-10-15 11:26:00 |
Message-ID: | 20071015112600.GB5806@svr2.hagander.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Oct 15, 2007 at 11:09:54AM +0200, Magnus Hagander wrote:
> On Sat, Oct 06, 2007 at 01:53:31PM -0400, Tom Lane wrote:
> > I am thinking that Dave's discovery explains some previously unsolved
> > bug reports, such as
> > http://archives.postgresql.org/pgsql-bugs/2007-05/msg00260.php
> > If Windows returns LC_CTYPE=C in a situation like this, then
> > the various single-byte-charset optimization paths that are enabled by
> > lc_ctype_is_c() would be mistakenly used, leading to misbehavior in
> > upper()/lower() and other places. ISTM we had better hack
> > lc_ctype_is_c() so that on Windows (only), if the database encoding
> > is UTF-8 then it returns FALSE regardless of what setlocale says.
>
> Yes, I think we a change to that routine.
>
> But. What about the case when we actually *have* locale=C and
> encoding=UTF8. We need to care for that one somehow. Perhaps we should look
> at LC_COLLATE instead (again, on Windows only. Possibly even only in the
> windows+locale_returns_c+encoring=utf8 case, to distinguish these two)?
Hmm. Looking more at that, may there be another problem? Looking at
WriteControlFile(), it writes out what setlocale(LC_CTYPE) returns, which
will then be "C" - even if the database isn't in C.
But I don't really know when that code is called, or if I'm just looking at
things wrong. Just starting up and shutting down the database leaves it at
Swedish_Sweden.1252, not C.
(1252 is still the wrong encoding specifyer, but it'll work anyway since we
convert to UTF16)
Now, I came across this trying to find a way for lc_ctype_is_c() to
determine if the database is in C locale or not, *without* resorting to
setlocale(). Any pointers on how to do that properly?
Also, any pointers on a way to check for the kind of failure that's to be
expected from this one returning the wrong thing?
> > One bright spot is that this does seem to suggest a way to implement the
> > recommendation I made in the -patches thread: if we can't support the
> > encoding (codepage) used by the locale seen by initdb, we could try
> > stripping the codepage indicator (if any) and plastering on .65001
> > to get a UTF8-compatible locale name. That'd only work on Windows
> > but that seems the platform where we're most likely to see unsupportable
> > default encodings.
>
> Um, yes, that should work - assuming encoding is set to UTF8. We can't do
> that for any other encoding, of course.
Looking at that, doesn't actually need to put that at the end of the
locale-name - all locale names will work with UTF8, even one specifying
1252.
Attached patch seems to work for me for that part. Still doesn't touch
lc_ctype_is_c().
//Magnus
Attachment | Content-Type | Size |
---|---|---|
win32_utf8.patch | text/plain | 2.9 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Magnus Hagander | 2007-10-15 11:40:10 | Re: Windows and locales and UTF-8 (oh my) |
Previous Message | Magnus Hagander | 2007-10-15 09:09:54 | Re: Windows and locales and UTF-8 (oh my) |