From: | Bart Samwel <bart(at)samwel(dot)tk> |
---|---|
To: | Johann Zuschlag <zuschlag2(at)online(dot)de> |
Cc: | Dave Page <dpage(at)vale-housing(dot)co(dot)uk>, Hiroshi Inoue <inoue(at)tpf(dot)co(dot)jp>, pgsql-odbc(at)postgresql(dot)org |
Subject: | Re: Unicode is not UTF-8. was :psqlODBC-Driver Test / text |
Date: | 2006-03-30 21:36:44 |
Message-ID: | 442C4F6C.2000607@samwel.tk |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-odbc |
Johann Zuschlag wrote:
> The problem with UTF-8 is that all ASCII characters are represented by
> one byte and all non ASCII characters, e.g. German Umlauts, are
> represented by two bytes. That's why UTF-8 is called a "variable-length
> multibyte encoding". In a pure Unicode world, e.g. U+xxxx with two
> bytes, every character is represented by two bytes (fixed-length
> multibyte encoding). So Unicode is not equal to UTF-8, even though the
> PostgreSQL documentation is stating that.
Well, it's actually even more complicated, because Unicode is actually a
32-bit character set. There is actually UTF8 (variable-length multibyte,
8 bits per unit), UTF16 (variable-length multibyte) and UTF32
(fixed-length multibyte). There is also UCS2 (fixed-length 16-bit),
which is limited to the 16 bits of the Basic Multilingual Plane, and
UCS4, which is functionally identical to UTF32. UTF-8 actually supports
up to 4 bytes per character, so it is more complete than the purely
16-bit UCS-2. Any of the variable-length encodings, and the 32-bit
UTF-32 and UCS-4 encodings can represent the whole of the character set.
A pure Unicode world can use any of those encodings, so it's a tradeoff.
If you want a direct relationship between the number of characters in a
string and the number of bytes taken, use a fixed-length encoding. If
you want to be able to encode everything, use a variable-length encoding
or a 32-bit encoding. If you want to use little space, use an 8-bit
encoding. That's it.
> Windows XP supports ANSI, UTF-8, Unicode and Unicode Big Endian.
> Unfortunately (or fortunately?) Windows seems to use UTF-8 for European
> languages. Hiroshi can you explain that? I guess the Japanese edition of
> Windows XP is using pure 2 byte Unicode.
In fact, the Win32 API is UTF-16 even in European languages(started out
as UCS-2 but became UTF-16 when Unicode went 32-bit :-) ), but it
provides an 8-bit compatibility interface. Don't know if te 8-bit
encoding is UTF-8 or plain 8-bit code pages though.
Reference: http://en.wikipedia.org/wiki/Unicode
Cheers,
Bart
From | Date | Subject | |
---|---|---|---|
Next Message | Marc Herbert | 2006-03-31 09:22:55 | Re: Unicode is not UTF-8. was :psqlODBC-Driver Test / text fields |
Previous Message | Hiroshi Inoue | 2006-03-30 21:35:12 | Re: Unicode is not UTF-8. was :psqlODBC-Driver Test / text fields |