Quick Links

Re: [HACKERS] Invalid unicode in COPY problem

From:	"John Hansen" <john(at)geeknet(dot)com(dot)au>
To:	"Tatsuo Ishii" <t-ishii(at)sra(dot)co(dot)jp>
Cc:	<linux(at)alteeve(dot)com>, <pgsql-general(at)postgresql(dot)org>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Invalid unicode in COPY problem
Date:	2005-05-08 08:55:21
Message-ID:	5066E5A966339E42AA04BA10BA706AE50A9310@rodrick.geeknet.com.au
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general pgsql-hackers

Tatsuo Ishii wrote:
> Sent: Sunday, May 08, 2005 3:31 PM
> To: John Hansen
> Cc: linux(at)alteeve(dot)com; pgsql-general(at)postgresql(dot)org;
> pgsql-hackers(at)postgresql(dot)org
> Subject: Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem
>
> > Tatsuo Ishii wrote:
> > > Sent: Sunday, May 08, 2005 12:01 PM
> > > To: linux(at)alteeve(dot)com
> > > Cc: pgsql-general(at)postgresql(dot)org; pgsql-hackers(at)postgresql(dot)org
> > > Subject: Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem
> > >
> > > We have developed patches which relaxes the character
> validation so
> > > that PostgreSQL accepts invalid characters. It works like this:
> >
> > That is just plain 100% wrong!!
> >
> > Under no circumstances should there be invalid data in a database.
> > And if you're trying to make a database of invalid data,
> then at least
> > encode it using a valid encoding.
> >
> > In fact, I've proposed strengthening the validation
> routines for UTF-8.
>
> Actually I myself thought as you are before. Later I found
> that it was not so good idea. People already have invalid
> encoded data in their precious database and have very hard
> time to migrate to newer version of PostgreSQL because of
> encoding validation.
>
> Think about this kind of situation:
>
> There is a table t1(member_id integer primary key,
> member_name text, address text, phone text, email text). I
> have to reach each member by either adress, phone or email.
> Unfortunately some of address field have wrong encoded data.
> In this case I will use phone or email to reach them.
>
> Now I need to upgrade to newer PostgreSQL within 1 day. I
> know I have to fix wrong encoded field but it will take more
> than 1 day. So I would like to import the data first then fix
> wrong encoded field on running database since I can reach
> members by phone or email even with wrong encoded address field...

Actually would be very simple, create function isvalidutf8(text) in your
preferred language.
C source is available from unicode.org.
Create function converttoutf8(text) using whatever code is required to
transform the _wrong_ encoding (SQL_ASCII -> UTF8 for instance) to
utf-8.
Update table set field=converttoutf8(field) where !isvalidutf8(field);

Now sit back and relax while your invalid data is converted to utf-8.

When done, pg_dump the database, upgrade, and reload.

This should take less than a day.

> I saw this kind of situation in the real world and that's why
> we developed the patches.
> --
> Tatsuo Ishii
>
>

Browse pgsql-general by date

	From	Date	Subject
Next Message	Karsten Hilbert	2005-05-08 09:15:43	Re: [HACKERS] Invalid unicode in COPY problem
Previous Message	Tino Wildenhain	2005-05-08 07:18:29	Re: [HACKERS] Invalid unicode in COPY problem

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Karsten Hilbert	2005-05-08 09:15:43	Re: [HACKERS] Invalid unicode in COPY problem
Previous Message	John Hansen	2005-05-08 08:47:25	Re: Patch for collation using ICU