Re: importing db as text files

From: expect <expect(at)ihubbell(dot)com>
To:
Cc: <pgsql-general(at)postgresql(dot)org>
Subject: Re: importing db as text files
Date: 2003-08-14 04:17:01
Message-ID: 20030813211701.031090a6.expect@ihubbell.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Wed, 13 Aug 2003 14:59:29 -0700
"Gregory S. Williamson" <gsw(at)globexplorer(dot)com> wrote:

> I tend to use perl to preprocess dumps (in our case from Informix). It tends
> to work much faster than shell scripts (although your mileage may vary). I

> have to fix missing numeric values (if the column allows nulls why can't the
> copy command accept an empty field, I wonder?), missing dates. In other cases

This seems to be the question I need answered as well. Does anyone know the
answer?

> is massages date formats, spatial data, etc. For example, a crude program
> below to clean DOQQ metadata:
>
> Greg W.
> ==================================
> firenze% more infxunl2psql
> #!/usr/dist/bin/perl -w
>
> $FILE = $ARGV[0];
> $OUTPUT = $ARGV[1];
> $MODE = $ARGV[2];
>
> open (INFILE,"$FILE");
> open (OUTFILE,">$OUTPUT");
>
> foreach $line (<INFILE>)
> {
> chop($line);
> chop($line);
> if (($MODE cmp "DOQ") == 0) {
> ($t_source_filename, $t_quadrangle_name, $t_west_longitude,
> $t_east_longitude, $t_north_latitude, $t_south_latitude,
> $t_production_date, $t_raster_order, $t_band_organization,
> $t_band_content, $t_bits_per_pixel, $t_samples_and_lines,
> $t_horizontal_datum, $t_horizontal_coordinate_system,
> $t_coordinate_zone, $t_horizontal_units, $t_horizontal_resolution,
> $t_secondary_horizontal_datum, $t_xy_origin, $t_secondary_xy_origin,
> $t_nw_quad_corner_xy, $t_ne_quad_corner_xy, $t_se_quad_corner_xy,
> $t_sw_quad_corner_xy, $t_secondary_nw_quad_xy,
> $t_secondary_ne_quad_xy, $tsecondary_se_quad_xy,
> $t_secondary_sw_quad_xy, $t_rmse_xy, $t_image_source,
> $t_source_dem_date, $t_agency, $t_producer, $t_production_system,
> $t_standard_version, $t_metadata_date, $t_data_file_size, $byte_count)
> = split(/\|/,$line); if (length($t_production_date) == 0) {
> $t_production_date = "\\N"; # psql seems to dump a blank data with
> this nomenclature}
> if (length($t_coordinate_zone) == 0) { # an integer
> $t_coordinate_zone = 0;
> }
> if (length($t_band_content) == 0) {
> $t_band_content = 0; # also an int
> }
> if (length($t_bits_per_pixel) == 0) {
> $t_bits_per_pixel = 0; # reasonable default for an int ?
> }
> if (length($t_horizontal_resolution) == 0) {
> $t_horizontal_resolution = 0.0;
> }
> if (length($t_secondary_horizontal_datum) == 0) {
> $t_secondary_horizontal_datum = "\'\'";
> }
> if (length($t_rmse_xy) == 0) {
> $t_rmse_xy = 0.0;
> }
> if (length($t_metadata_date) == 0) {
> $t_metadata_date = "\\N";
> }
> if (length($t_data_file_size) == 0) {
> $t_data_file_size = 0; # a big int
> }
> if (length($byte_count) == 0) {
> $byte_count = 0; # reasonable default ? for an int
> }
> $out_line = $t_source_filename . "|" . $t_quadrangle_name . "|" .
> $t_west_longitude . "|" . $t_east_longitude . "|" . $t_north_latitude
> . "|" . $t_south_latitude . "|" . $t_production_date . "|" .
> $t_raster_order . "|" . $t_band_org
> anization . "|" . $t_band_content . "|" . $t_bits_per_pixel . "|" .
> $t_samples_and_lines . "|" . $t_horizontal_datum . "|" .
> $t_horizontal_coordinate_system . "|" . $t_coordinate_zone . "|" .
> $t_horizontal_units . "|" . $t_horizontal_resolu tion . "|" .
> $t_secondary_horizontal_datum . "|" . $t_xy_origin . "|" .
> $t_secondary_xy_origin . "|" . $t_nw_quad_corner_xy . "|" .
> $t_ne_quad_corner_xy . "|" . $t_se_quad_corner_xy . "|" . $t_sw_quad_corner_xy
> . "|" . $t_secondary_nw_quad_xy . "|" . $t_secondary_ne_quad_xy . "|" .
> $tsecondary_se_quad_xy . "|" . $t_secondary_sw_quad_xy . "|" . $t_rmse_xy .
> "|" . $t_image_source . "|" . $t_source_dem_date . "|" . $t_agency . "|" .
> $t_producer . "|" . $t_production_system . "|" . $t_standard_version . "|" .
> $t_metadata_date . "|" . $t_data_file_size . "|" . $byte_count;
> print OUTFILE "$out_line\n";
> }
> else {
> print OUTFILE "$line\n";
> }
> }
> close INFILE;
> close OUTFILE;
>
> -----Original Message-----
> From: Jason Godden [mailto:jasongodden(at)optushome(dot)com(dot)au]
> Sent: Wednesday, August 13, 2003 2:35 PM
> To: expect; pgsql-general(at)postgresql(dot)org
> Subject: Re: [GENERAL] importing db as text files
>
>
> Hi expect,
>
> Best way in my opinion is to use the copy table command. This way Pg will
> actually 'massage' the data (string escapes and all) for you.
>
> If you have complex files best to use a combo of sed/awk to pre-massage the
> field delimiters and import the data to a temporary table. In most instances
> it is best to pipe the data to the psql command using copy table from stdin
> as from file requires that you are the process owner.
>
> Always use a temporary import table and perform validation/further conversion
> in that table (IMO).
>
> eg (from one of my earlier posts)
>
> cat unprocesseddata.txt | sed "s/\",\"/$TAB/g" | sed "s/\"//g" | sed "s/.$//g"
>
> | psql -dmydatabase -c "copy importtable from stdin delimiter '\t';"
>
> The first sed will replace all instances of "," with a tab character
> (I achieve tab in my shell scripts with a line like TAB=$(printf "\t"))
>
> The second sed will remove the start and end " because the first only checks
> for ",".
>
> The third sed will remove the Windows carriage returns (if any - depends on
> the platform the file was generated from.
>
> Note here that my file format never contains any " in a field so I can safely
> run the second sed. If your import file contains these then you will need to
> change the regex or use awk.
>
> Regards,
>
> Jason
>
> On Thu, 14 Aug 2003 07:14 am, expect wrote:
> > What's the big deal with importing text files? I have a 70 MB file to
> > import and it's been one problem after another. I used the copy command
> > and it appears that it's just not possible. I finally massaged the file
> > into a .sql file and ran that using \i db.sql but that failed too because I
> > overlooked ' in names like D'Adario. The other problem I encountered was
> > that a numeric field had to have data in it, pg would not default to the
> > default value. So instead of massaging all the data again I decided to
> > change the data type for that column. This is my first experience with
> > postgresql and I'm wondering if I should expect to encounter similar pain
> > as I go further into this? So far it's been very painful trying to do what
> > I thought would be easy and what I think should be easy.
> >
> > PostgreSQL 7.3.4 on linux redhat 9
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 7: don't forget to increase your free space map settings
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo(at)postgresql(dot)org
>
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Bruce Momjian 2003-08-14 04:23:34 Re: Why the duplicate messages to pgsql-general?
Previous Message expect 2003-08-14 04:14:17 Re: importing db as text files