importing a messy text file

From: Willy-Bas Loos <willybas(at)gmail(dot)com>
To: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: importing a messy text file
Date: 2014-04-30 08:07:09
Message-ID: CAHnozTiOp45ur4=6kqUuPBadptFjM=hDgGa1fbZUmF1Z-UZijA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi,

I have a 56GB textfile that i want to import into postgres.
The file is tab delimited and not quoted.
I deleted the header with the column names (using sed) so that i could use
COPY with the non-csv text type (because some of the text values contain
quotes).

I had some minor trouble with the file which i managed, but now i have one
where i can't think of a solution, even though it seems so simple.

The problem is this:
There is a tab after the last column, in many but not all records.
When i ran into the extra tab i added a dummy column in the destination
table but now COPY thows an error because the data for the dummy column is
missing (on record ~275K of about 150M).

The file is too big to edit by hand and anyway it would probably not be
feasible to manually add tabs for every record that misses one, although i
don't know how many it would be.

I realize that there could be other showstoppers in the file, like missing
tabs in the middle or extra tabs in the middle, but i would like to try and
get this fixed.

Maybe it would be feasible to add every record as 1 value and then
splitting those into columns using postgres text processing.
Or maybe there is an (undocumented?) option in copy or \copy to ignore
extra columns.
Or maybe there is some no-sql software where i can import this and then
structure the data before i pass it to postgres..

Do you have any tips, please?

Cheers,

--
Willy-Bas Loos

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Rory Campbell-Lange 2014-04-30 08:08:01 Re: Security Issues: Allowing Clients to Execute SQL in the Backend.
Previous Message Hello World 2014-04-30 07:49:37 Re: Security Issues: Allowing Clients to Execute SQL in the Backend.