From: | David Gagnon <dgagnon(at)siunik(dot)com> |
---|---|
To: | pgsql-general(at)postgresql(dot)org |
Subject: | COPY command use UTF-8 encoding and NOT UNICODE(16bits)... please confirm. Should postgresql add :set CLIENT_ENCODING to 'UTF-8'; to avoid confusion |
Date: | 2005-04-06 22:12:06 |
Message-ID: | 42545EB6.6070304@siunik.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Hi all,
I ran into this problem and want to share and have a confirmation.
I tried to use COPY function to load bulk data. I craft myself a
UNICODE file from a MSSQL db. I can't load it into the postgresql. I
always get the error: CONTEXT: COPY vd, line 1, column vdnum: "ÿþ1"
The problem is that both file are exactly the same... I found that
pg_dump create in fact a UTF-8 (Confirm please) file with is UNICODE but
with variable length encoding (Ie: Some character user 8 bytes and other
16 bytes ...). See for detail:
http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8. The file I crafted
is a true UNICODE (16 bytes or *UCS-2) file (Confirm please)*
So here is the content of the file:
UTF-8 (Postgresql dump):
1 1 1 AC COLUMNÿACNUMÿACDESCÿACDELPAIÿ
UNICODE (crafted from mssql)
1 1 1 AC COLUMNÿACNUMÿACDESCÿACDELPAIÿ
HEX representation UTF-8 (Postgresql dump):
00000000:31 09 31 09 31 09 41 43 09 43 4f 4c 55 4d 4e c3 1.1.1.AC.COLUMNÃ
00000010:bf 41 43 4e 55 4d c3 bf 41 43 44 45 53 43 c3 bf ¿ACNUMÿACDESCÿ
00000020:41 43 44 45 4c 50 41 49 c3 bf ACDELPAIÿ
HEX representation UNICODE (crafted from mssql)
00000000:ff fe 31 00 09 00 31 00 09 00 31 00 09 00 41 00 ÿþ1...1...1...A.
00000010:43 00 09 00 43 00 4f 00 4c 00 55 00 4d 00 4e 00 C...C.O.L.U.M.N.
00000020:ff 00 41 00 43 00 4e 00 55 00 4d 00 ff 00 41 00 ÿ.A.C.N.U.M.ÿ.A.
00000030:43 00 44 00 45 00 53 00 43 00 ff 00 41 00 43 00 C.D.E.S.C.ÿ.A.C.
00000040:44 00 45 00 4c 00 50 00 41 00 49 00 ff 00 D.E.L.P.A.I.ÿ.
So postgresql bug with the FF FE that start the UNICODE document. Is
that normal UNICODE file starts with this FF FE ?! Note that I tried to
delete those character but they aren`t visible...
So am I right ? Is Postgresql using UTF-8 and don`t really understand
UNICODE file (UCS-2)? Is there a way I can make the COPY command with a
UNICODE UCS-2 encoding
Thanks for your help
/David
Attachment | Content-Type | Size |
---|---|---|
vdOk.backup | text/plain | 43 bytes |
vdNotOk.backup | text/plain | 78 bytes |
From | Date | Subject | |
---|---|---|---|
Next Message | Martijn van Oosterhout | 2005-04-06 22:15:10 | Re: Big trouble with memory !! |
Previous Message | David Parker | 2005-04-06 21:06:44 | monitoring database activity on solaris |