From: | Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | John Hansen <john(at)geeknet(dot)com(dot)au>, Hackers <pgsql-hackers(at)postgresql(dot)org>, Patches <pgsql-patches(at)postgresql(dot)org> |
Subject: | Re: UNICODE characters above 0x10000 |
Date: | 2004-08-07 06:27:31 |
Message-ID: | Pine.LNX.4.44.0408070820300.9559-100000@zigo.dhs.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers pgsql-patches |
On Sat, 7 Aug 2004, Tom Lane wrote:
> shy of a load --- for instance I see that pg_utf_mblen thinks there are
> no UTF8 codes longer than 3 bytes whereas your code goes to 4. I'm not
> an expert on this stuff, so I don't know what the UTF8 spec actually
> says. But I do think you are fixing the code at the wrong level.
I can give some general info about utf-9. This is how it is encoded:
character encoding
------------------- ---------
00000000 - 0000007F: 0xxxxxxx
00000080 - 000007FF: 110xxxxx 10xxxxxx
00000800 - 0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
00010000 - 001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
00200000 - 03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
04000000 - 7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
If the first byte starts with a 1 then the number of ones give the
length of the utf-8 sequence. And the rest of the bytes in the sequence
always starts with 10 (this makes it possble to look anywhere in the
string and fast find the start of a character).
This also means that the start byte can never start with 7 or 8 ones, that
is illegal and should be tested for and rejected. So the longest utf-8
sequence is 6 bytes (and the longest character needs 4 bytes (or 31
bits)).
--
/Dennis Björklund
From | Date | Subject | |
---|---|---|---|
Next Message | John Hansen | 2004-08-07 06:29:20 | Re: UNICODE characters above 0x10000 |
Previous Message | Oleg Bartunov | 2004-08-07 06:13:27 | Re: cvsweb down temporarily |
From | Date | Subject | |
---|---|---|---|
Next Message | John Hansen | 2004-08-07 06:29:20 | Re: UNICODE characters above 0x10000 |
Previous Message | Tom Lane | 2004-08-07 06:10:51 | Re: UNICODE characters above 0x10000 |