From: | "John Hansen" <john(at)geeknet(dot)com(dot)au> |
---|---|
To: | "Tatsuo Ishii" <t-ishii(at)sra(dot)co(dot)jp> |
Cc: | <linux(at)alteeve(dot)com>, <pgsql-general(at)postgresql(dot)org>, <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: [HACKERS] Invalid unicode in COPY problem |
Date: | 2005-05-08 08:55:21 |
Message-ID: | 5066E5A966339E42AA04BA10BA706AE50A9310@rodrick.geeknet.com.au |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general pgsql-hackers |
Tatsuo Ishii wrote:
> Sent: Sunday, May 08, 2005 3:31 PM
> To: John Hansen
> Cc: linux(at)alteeve(dot)com; pgsql-general(at)postgresql(dot)org;
> pgsql-hackers(at)postgresql(dot)org
> Subject: Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem
>
> > Tatsuo Ishii wrote:
> > > Sent: Sunday, May 08, 2005 12:01 PM
> > > To: linux(at)alteeve(dot)com
> > > Cc: pgsql-general(at)postgresql(dot)org; pgsql-hackers(at)postgresql(dot)org
> > > Subject: Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem
> > >
> > > We have developed patches which relaxes the character
> validation so
> > > that PostgreSQL accepts invalid characters. It works like this:
> >
> > That is just plain 100% wrong!!
> >
> > Under no circumstances should there be invalid data in a database.
> > And if you're trying to make a database of invalid data,
> then at least
> > encode it using a valid encoding.
> >
> > In fact, I've proposed strengthening the validation
> routines for UTF-8.
>
> Actually I myself thought as you are before. Later I found
> that it was not so good idea. People already have invalid
> encoded data in their precious database and have very hard
> time to migrate to newer version of PostgreSQL because of
> encoding validation.
>
> Think about this kind of situation:
>
> There is a table t1(member_id integer primary key,
> member_name text, address text, phone text, email text). I
> have to reach each member by either adress, phone or email.
> Unfortunately some of address field have wrong encoded data.
> In this case I will use phone or email to reach them.
>
> Now I need to upgrade to newer PostgreSQL within 1 day. I
> know I have to fix wrong encoded field but it will take more
> than 1 day. So I would like to import the data first then fix
> wrong encoded field on running database since I can reach
> members by phone or email even with wrong encoded address field...
Actually would be very simple, create function isvalidutf8(text) in your
preferred language.
C source is available from unicode.org.
Create function converttoutf8(text) using whatever code is required to
transform the _wrong_ encoding (SQL_ASCII -> UTF8 for instance) to
utf-8.
Update table set field=converttoutf8(field) where !isvalidutf8(field);
Now sit back and relax while your invalid data is converted to utf-8.
When done, pg_dump the database, upgrade, and reload.
This should take less than a day.
> I saw this kind of situation in the real world and that's why
> we developed the patches.
> --
> Tatsuo Ishii
>
>
From | Date | Subject | |
---|---|---|---|
Next Message | Karsten Hilbert | 2005-05-08 09:15:43 | Re: [HACKERS] Invalid unicode in COPY problem |
Previous Message | Tino Wildenhain | 2005-05-08 07:18:29 | Re: [HACKERS] Invalid unicode in COPY problem |
From | Date | Subject | |
---|---|---|---|
Next Message | Karsten Hilbert | 2005-05-08 09:15:43 | Re: [HACKERS] Invalid unicode in COPY problem |
Previous Message | John Hansen | 2005-05-08 08:47:25 | Re: Patch for collation using ICU |