| From: | Peter Eisentraut <peter_e(at)gmx(dot)net> | 
|---|---|
| To: | pgsql-hackers(at)postgresql(dot)org | 
| Cc: | Andrew Dunstan <andrew(at)dunslane(dot)net>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, "- -" <crossroads0000(at)googlemail(dot)com> | 
| Subject: | Re: Unicode support | 
| Date: | 2009-04-14 12:32:44 | 
| Message-ID: | 200904141532.44618.peter_e@gmx.net | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
On Monday 13 April 2009 22:39:58 Andrew Dunstan wrote:
> Umm, but isn't that because your encoding is using one code point?
>
> See the OP's explanation w.r.t. canonical equivalence.
>
> This isn't about the number of bytes, but about whether or not we should
> count characters encoded as two or more combined code points as a single
> char or not.
Here is a test case that shows the problem (if your terminal can display 
combining characters (xterm appears to work)):
SELECT U&'\00E9', char_length(U&'\00E9');
 ?column? | char_length
----------+-------------
 é        |           1
(1 row)
SELECT U&'\0065\0301', char_length(U&'\0065\0301');
 ?column? | char_length 
----------+-------------
 é        |           2
(1 row)
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Peter Eisentraut | 2009-04-14 12:36:35 | Re: Unicode support | 
| Previous Message | Andrew Dunstan | 2009-04-14 12:10:54 | Re: Unicode string literals versus the world |