Re: Invalid unicode in COPY problem

From: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
To: linux(at)alteeve(dot)com
Cc: pgsql-general(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Invalid unicode in COPY problem
Date: 2005-05-08 02:01:10
Message-ID: 20050508.110110.59654214.t-ishii@sra.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

We have developed patches which relaxes the character validation so
that PostgreSQL accepts invalid characters. It works like this:

1) new postgresql.conf item "mbstr_check" added.
2) if mbstr_check = 0 then invalid characters are not accepted
(same as current PostgreSQL behavior). This is the default.
3) if mbstr_check = 1 then invalid characters are accepted with
WARNING
4) if mbstr_check = 2 then invalid characters are accepted without any
warnings
5) We have checked PostgreSQL source code if accepting invalid
characters makes some troubles. We have found that we need to fix a
place and the fix is included in the patches.

Madison,
If you are interested in the patches, I could send it to you.

Hackers,
Do you think the functionality something like above is worth to add to
PostgreSQL?
--
Tatsuo Ishii

> Hi all,
>
> I've been chasing down a bug and from what I have learned it may be
> because of how postgreSQL (8.0.2 on Fedora Core 4 test 2) handles
> invalid unicode. I've been given some ideas on how to try to catch
> invalid unicode but it seems expensive so I am hoping there is a
> postgresql way to deal with this problem.
>
> I've run into a problem where a bulk postgres "COPY..." statement is
> dieing because one of the lines contains a file name with an invalid
> unicode character. In nautilus this file has '(invalid encoding)' and
> the postgres error is 'CONTEXT: COPY file_info_3, line 228287, column
> file_name: "Femme Fatal\uffff.url"'.
>
> To actually look at the file from the shell (bash) shows what appears
> to be a whitespace but when I copy/paste the file name I get the
> '\uffff' you see above.
>
> I could, with the help of the TLUG people, use regex to match for an
> invalid character and skip the file but that is not ideal. The reason is
> that this is for my backup program and invalid unicode or not, the
> contents of the file may still be important and I would prefer to have
> it in the database so that it is later copied. I can copy and move the
> file in the shell so the file isn't apparently in an of itself corrupt.
>
> So then, is there a way I can tell postresql to accept the invalid
> unicode name? Here is a copy of my schema:
>
> tle-bu=> \d file_info_2
> Table "public.file_info_2"
> Column | Type | Modifiers
> ----------------------+----------------------+-----------------------------------------
> file_group_name | text |
> file_group_uid | bigint | not null
> file_mod_time | bigint | not null
> file_name | text | not null
> file_parent_dir | text | not null
> file_perm | text | not null
> file_size | bigint | not null
> file_type | character varying(2) | not null default
> 'f'::character varying
> file_user_name | text |
> file_user_uid | bigint | not null
> file_backup | boolean | not null default true
> file_display | boolean | not null default false
> file_restore_display | boolean | not null default false
> file_restore | boolean | not null default false
> Indexes:
> "file_info_2_display_idx" btree (file_type, file_parent_dir, file_name)
>
> 'file_name' and 'file_parent_dir' are the columns that could have
> entries with the invalid unicode characters. Maybe I could/should use
> something other than 'text'? These columns could contain anything that a
> file or directory name could be.
>
> Thanks!
>
> Madison
>
> --
> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
> Madison Kelly (Digimer)
> TLE-BU, The Linux Experience; Back Up
> http://tle-bu.thelinuxexperience.com
> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Madison Kelly 2005-05-08 04:02:37 Re: Invalid unicode in COPY problem
Previous Message Madison Kelly 2005-05-08 00:22:53 Invalid unicode in COPY problem

Browse pgsql-hackers by date

  From Date Subject
Next Message Neil Conway 2005-05-08 02:05:57 Re: pl/pgsql enabled by default
Previous Message Mike Mascari 2005-05-08 01:59:39 Re: pl/pgsql enabled by default