Re: type-casting failures

From: Alban Hertroys <haramrae(at)gmail(dot)com>
To: Natalie Wenz <nataliewenz(at)ebureau(dot)com>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: type-casting failures
Date: 2013-06-21 17:54:02
Message-ID: 2EA59CD9-8B59-46EF-955D-5223725EDCDA@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Jun 21, 2013, at 19:21, Natalie Wenz <nataliewenz(at)ebureau(dot)com> wrote:

> Hi all!
>
> I am working on migrating data in a very large database (22TB +) between two existing tables that use type text for almost every field, to two new tables that use native data type for dates, integers, etc.
>
> The way I'm doing this is running a series of "Insert into new_table values (stuff)… select stuff::appropriate_cast from old_table". I'm splitting the data up into chunks using my one index on a date field.
>
> The problem I'm running into is I keep hitting data that won't cast, and then the insert fails with a message like this:
>
> ERROR: invalid input syntax for integer: "X"
>
> I have asked the list how to select for things that are not integer in the past, and the suggestions I got were to use regular expressions (which seems like it would not do that well performance-wise, with the amount of data I'm dealing with), or to dump all the data out and write a program to scrub it. I'm not a terribly experienced programmer, and I don't have enough extra disk space to dump all of the data, read it in, write out a new copy, and then import it back into the database. That would put me at about four copies of the data, rather than two.
>
> Has anyone else encountered this difficulty? Does anyone have any ideas?

I've been running into similar issues querying an Oracle database and although I never had a chance to implement it (people didn't dare to modify the database that was for a 3rd party application) I did have an idea at the time.

The idea is to use exception handling to cast what you can to the desired data type. I'm not sure how well exception handling scales to a database the size of yours though, considering it creates temporary sub-transactions each time.

Off the top of my head, implementation in PG would be like this:

CREATE OR REPLACE FUNCTION appropriate_name_here(value text) RETURNS boolean
$$
DECLARE
intvalue int;
BEGIN
FOR SELECT stuff FROM old_table LOOP
BEGIN
intvalue := stuff.charvalue::int;
EXCEPT WHEN <invalid input syntax error code here>
RETURN NEXT false;
END

INSERT INTO new_table VALUES (intvalue);
RETURN NEXT true;

END LOOP;

RETURN;
END
$$

I'm a bit rusty on PG stored procedures, you can probably tell :P

You could probably do all your casts to the appropriate tables in one pass using this approach. You probably also want to add an INSERT into a failures table for when casts fail, so you can look up the trouble-makers easily.

Another option is using an ETL tool to extract the data from your database and load it back in into the appropriate tables. That would work quite similar as the stored procedure approach, but you get more bells and whistles as an advantage.

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Adrian Klaver 2013-06-21 17:54:46 Re: File System backup
Previous Message Adrian Klaver 2013-06-21 17:52:55 Re: 9.3-beta postgres-fdw COPY error