From: | Gregory Stark <stark(at)enterprisedb(dot)com> |
---|---|
To: | PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | UTF8 on Debian |
Date: | 2007-10-15 21:07:06 |
Message-ID: | 87fy0c2ikl.fsf@oxford.xeocode.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Something very strange is going on on my machine with UTF8:
postgres=# show server_encoding;
server_encoding
-----------------
UTF8
(1 row)
postgres=# select length(convert_from(E'\343\203\251\343\202\244\343\202\273\343\203\263','utf8'));
length
--------
8
(1 row)
postgres=# select 'substring(s,'||i||',1)',convert_to(substring(s,i,1),'utf8') from (select convert_from(E'\343\203\251\343\202\244\343\202\273\343\203\263','utf8') as s)a, (select generate_series(1,8) as i)b;
?column? | convert_to
------------------+------------
substring(s,1,1) | \343
substring(s,2,1) | \203\251
substring(s,3,1) | \343
substring(s,4,1) | \202\244
substring(s,5,1) | \343
substring(s,6,1) | \202\273
substring(s,7,1) | \343
substring(s,8,1) | \203\263
(8 rows)
I believe this is in fact only four katakana characters. (Namely U+30E9 U+30A4
U+30BB U+30F3) \343 is merely the first byte of each three-byte encoding for
the individual characters.
Dave doesn't see the same behaviour on this three machines, so I think it's
something unique to my machine. Possibly not a Postgres bug at all but some
kind of install gotcha.
I'm running Debian unstable with glibc 2.6.1-4 so it is a bit bleeding edge.
But as I understand it the utf8 decoding is all our code anyways so I can't
quite figure out how it could be glibc's fault.
Does anybody else see anything like this?
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2007-10-15 21:36:50 | pgsql: Add sample text search dictionary templates and parsers, to |
Previous Message | Magnus Hagander | 2007-10-15 17:44:00 | Re: Windows and locales and UTF-8 (oh my) |