Re: Finding Errors in .csv Input Data

From: Andy Colson <andy(at)squeakycode(dot)net>
To: Rich Shepard <rshepard(at)appl-ecosys(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Finding Errors in .csv Input Data
Date: 2011-02-23 01:37:09
Message-ID: 4D6464C5.6040400@squeakycode.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 02/22/2011 07:25 PM, Andy Colson wrote:
> On 02/22/2011 07:10 PM, Rich Shepard wrote:
>> I'm sure many of you have solved this problem in the past and can offer
>> solutions that will work for me. The context is a 73-column postgres table
>> of data that was originally in an Access .mdb file. A colleague loaded the
>> file into Access and wrote a .csv file for me to use since we have nothing
>> Microsoft here. There are 110,752 rows in the file/table. After a lot of
>> cleaning with emacs and sed, the copy command accepted all but 80 rows of
>> data. Now I need to figure out why postgres reports them as having too many
>> columns.
>>
>> Starting to work with a single row, I first compared by cleaned row to the
>> raw .csv from the Access output. They match column-by-column. Then I copied
>> the schema to a text file and started comparing the .csv data
>> column-by-column. While this looks OK to me, postgres doesn't like it. For
>> example, I get this error message:
>>
>> nevada=# \copy water_well from 'one.csv' with delimiter '|' null '' CSV;
>> ERROR: value too long for type character(1)
>> CONTEXT: COPY water_well, line 1, column gravel_packed: "106"
>>
>> Yet, the column comparison for gravel_packed and surrounding attributes
>> does not show this:
>>
>> lot_no TEXT, |
>> block_no TEXT, |
>> well_finish_date DATE, 11/15/1948|
>> date_cmplt_acc CHAR(1), D|
>> gravel_packed CHAR(1), |
>> depth_seal INTEGER, |
>> depth_drilled INTEGER, 106|
>> depth_bedrock INTEGER, |
>> aquifer_desc TEXT, |
>>
>> Notice the NULL value for gravel_packed, while the "106" is for
>> depth_drilled, two columns later.
>>
>> I'm stymied and wonder if there's a tool I can use to fix these 80 rows so
>> the copy command will accept them.
>>
>> Rich
>>
>
> Can we see a few lines of one.csv? If we ignore the CONTEXT line, and just look at the error "too long for char(1)", it look like only two columns to think about date_cmplt_acc and gravel_packed. Sure there is no extra spaces or tabs or weirdness in the file for those two columns?
>
> You might also consider dumping out insert statements. Might be a little slower, but simpler to debug.
>
> Did access dump out comma separated with quoted fields? I've done this on several occasions and never had to do any cleaning. But... I also use perl to split the csv and fire off inserts/copys/updates/etc.
>
> -Andy
>

Hum... and another also: you could change your fields to all be 'text', then do the import. Then clean up the data with a few:

select * from water_well where length(date_cmplt_acc) > 1

And I see that you have 73 columns, not just the few you posted. Well, I suppose... one of these days, I'll start paying more attention :-)

-Andy

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Adrian Klaver 2011-02-23 01:45:53 Re: Finding Errors in .csv Input Data
Previous Message Andy Colson 2011-02-23 01:25:14 Re: Finding Errors in .csv Input Data