From: | "Trevor Talbot" <quension(at)gmail(dot)com> |
---|---|
To: | "Dave Page" <dpage(at)postgresql(dot)org> |
Cc: | "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Peter Eisentraut" <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Locale + encoding combinations |
Date: | 2007-10-12 13:03:52 |
Message-ID: | 90bce5730710120603t1d10b20ld689ef41b201026b@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 10/12/07, Dave Page <dpage(at)postgresql(dot)org> wrote:
> Tom Lane wrote
> > That still leaves us with the problem of how to tell whether a locale
> > spec is bad on Windows. Judging by your example, Windows checks whether
> > the code page is present but not whether it is sane for the base locale.
> > What happens when there's a mismatch --- eg, what encoding do system
> > messages come out in?
>
> I'm not sure how to test that specifically, but it seems that accented
> characters simply fall back to their undecorated equivalents if the
> encoding is not appropriate, eg:
>
> Dave(at)SNAKE:~$ ./setlc French_France.1252
> Locale: French_France.1252
> The date is: sam. 01 of août 2007
> Dave(at)SNAKE:~$ ./setlc French_France.28597
> Locale: French_France.28597
> The date is: sam. 01 of aout 2007
>
> (the encodings used there are WIN1252 and ISO8859-7 (Greek)).
>
> I'm happy to test further is you can suggest how I can figure out the
> encoding actually output.
The encoding output is the one you specified. Keep in mind,
underneath Windows is mostly working with Unicode, so all characters
exist and the locale rules specify their behavior there. The encoding
is just the byte stream it needs to force them all into after doing
whatever it does to them. As you've seen, it uses some sort of
best-fit mapping I don't know the details of. (It will drop accent
marks and choose characters with similar shape where possible, by
default.)
I think it's a bit more complex for input/transform cases where you
operate on the byte stream directly without intermediate conversion to
Unicode, which is why UTF-8 doesn't work as a codepage, but again I
don't have the details nearby. I can try to do more digging if
needed.
From | Date | Subject | |
---|---|---|---|
Next Message | Martijn van Oosterhout | 2007-10-12 13:22:48 | Re: Locales and Encodings |
Previous Message | Gregory Stark | 2007-10-12 13:03:47 | Re: Locales and Encodings |