From: | Craig Ringer <ringerc(at)ringerc(dot)id(dot)au> |
---|---|
To: | Andrew Sullivan <ajs(at)crankycanuck(dot)ca> |
Cc: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: Concerning about Unicode-aware string handling |
Date: | 2012-05-22 04:31:50 |
Message-ID: | 4FBB16B6.5020103@ringerc.id.au |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On 05/21/2012 06:59 PM, Andrew Sullivan wrote:
> On Mon, May 21, 2012 at 02:44:45AM -0700, John R Pierce wrote:
>> support the bastardized UTF-16 'unicode' implemented by Windows NT
> To be fair to Microsoft, while the BOM might be an irritant, they do
> use a perfectly legitimate encoding of Unicode. There is no Unicode
> requirement that code points be stored as UTF-8, and there is a strong
> argument to be made that, for some languages, UTF-8 is extremely
> inefficient and therefore the least preferred encoding. (Microsoft's
> dependence on the BOM with UTF-16 -- really UCS2 -- is problematic, of
> course, and appears to be adjusted in funny ways in Win 7.)
In fact, until it became clear that UCS-2 (now UTF-16) wasn't enough and
we'd need 4 bytes to represent characters, Microsoft's choice of UCS-2
with BOM looked really good. They just didn't realise that UCS-2 would
turn into UTF-16 when UCS-4 came on the scene, so they'd be left holding
a bastardised half-way mess that's usually-but-not-always 2 bytes per
character.
MS's choice allowed programs to work with the safe (at the time)
assumption that each char was 2 bytes, which made a lot of things way
simpler than they are in UTF-8 and was well and truly worth the storage
bloat IMO. Pity Unicode had to grow again and break the assumption.
--
Craig Ringer
From | Date | Subject | |
---|---|---|---|
Next Message | Jayashankar K B | 2012-05-22 05:57:20 | Postgres process is crashing continously in 9.1.1 |
Previous Message | Tom Lane | 2012-05-22 02:55:22 | Re: FATAL: lock file "postmaster.pid" already exists |