Invalid unicode in COPY problem

From: Madison Kelly <linux(at)alteeve(dot)com>
To: PgSQL General List <pgsql-general(at)postgresql(dot)org>
Subject: Invalid unicode in COPY problem
Date: 2005-05-08 00:22:53
Message-ID: 427D5BDD.3000606@alteeve.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

Hi all,

I've been chasing down a bug and from what I have learned it may be
because of how postgreSQL (8.0.2 on Fedora Core 4 test 2) handles
invalid unicode. I've been given some ideas on how to try to catch
invalid unicode but it seems expensive so I am hoping there is a
postgresql way to deal with this problem.

I've run into a problem where a bulk postgres "COPY..." statement is
dieing because one of the lines contains a file name with an invalid
unicode character. In nautilus this file has '(invalid encoding)' and
the postgres error is 'CONTEXT: COPY file_info_3, line 228287, column
file_name: "Femme Fatal\uffff.url"'.

To actually look at the file from the shell (bash) shows what appears
to be a whitespace but when I copy/paste the file name I get the
'\uffff' you see above.

I could, with the help of the TLUG people, use regex to match for an
invalid character and skip the file but that is not ideal. The reason is
that this is for my backup program and invalid unicode or not, the
contents of the file may still be important and I would prefer to have
it in the database so that it is later copied. I can copy and move the
file in the shell so the file isn't apparently in an of itself corrupt.

So then, is there a way I can tell postresql to accept the invalid
unicode name? Here is a copy of my schema:

tle-bu=> \d file_info_2
Table "public.file_info_2"
Column | Type | Modifiers
----------------------+----------------------+-----------------------------------------
file_group_name | text |
file_group_uid | bigint | not null
file_mod_time | bigint | not null
file_name | text | not null
file_parent_dir | text | not null
file_perm | text | not null
file_size | bigint | not null
file_type | character varying(2) | not null default
'f'::character varying
file_user_name | text |
file_user_uid | bigint | not null
file_backup | boolean | not null default true
file_display | boolean | not null default false
file_restore_display | boolean | not null default false
file_restore | boolean | not null default false
Indexes:
"file_info_2_display_idx" btree (file_type, file_parent_dir, file_name)

'file_name' and 'file_parent_dir' are the columns that could have
entries with the invalid unicode characters. Maybe I could/should use
something other than 'text'? These columns could contain anything that a
file or directory name could be.

Thanks!

Madison

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Madison Kelly (Digimer)
TLE-BU, The Linux Experience; Back Up
http://tle-bu.thelinuxexperience.com
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tatsuo Ishii 2005-05-08 02:01:10 Re: Invalid unicode in COPY problem
Previous Message Tony Caduto 2005-05-07 21:27:13 You can do Postgresql DB GUI programming with Lazarus (free delphi clone)

Browse pgsql-hackers by date

  From Date Subject
Next Message Mike Mascari 2005-05-08 01:59:39 Re: pl/pgsql enabled by default
Previous Message Tatsuo Ishii 2005-05-08 00:08:45 Re: Patch for collation using ICU